使用kb-python进行Bulk RNA-seq比对#

本Notebook演示了在OmicVerse中使用kb-python(kallisto | bustools)无比对bulk RNA-seq工作流程,从SRA文件到**差异表达(DE)**和可视化。

流程概述

  1. 导入OmicVerse并设置绘图风格

  2. 下载SRA数据(直接.lite.1链接)

  3. 将SRA转换为双端FASTQ(parallel_fastq_dump

  4. 下载参考基因组/注释(Ensembl GRCh38)

  5. 构建kb参考(索引 + 转录本到基因映射)

  6. 使用kb-python定量reads(ov.alignment.count,technology=BULK

  7. 将样本合并为单一矩阵,通过ov.bulk.pyDEG运行DESeq2

  8. 可视化DEG(火山图)

重要提示 本Notebook不修改任何现有代码或输出;仅在提供的单元格周围添加教程风格的Markdown说明。

步骤0 — 导入OmicVerse#

我们首先导入OmicVerse并为下游图形设置一致的绘图风格。

%%time
import omicverse as ov
ov.style(font_path='Arial')
🔬 Starting plot initialization...
Using already downloaded Arial font from: /tmp/omicverse_arial.ttf
Registered as: Arial
🧬 Detecting GPU devices…
✅ NVIDIA CUDA GPUs detected: 1
    • [CUDA 0] Tesla P40
      Memory: 22.4 GB | Compute: 6.1

   ____            _     _    __                  
  / __ \____ ___  (_)___| |  / /__  _____________ 
 / / / / __ `__ \/ / ___/ | / / _ \/ ___/ ___/ _ \ 
/ /_/ / / / / / / / /__ | |/ /  __/ /  (__  )  __/ 
\____/_/ /_/ /_/_/\___/ |___/\___/_/  /____/\___/                                              

🔖 Version: 1.7.9rc1   📚 Tutorials: https://omicverse.readthedocs.io/
✅ plot_set complete.

CPU times: user 7.88 s, sys: 2.53 s, total: 10.4 s
Wall time: 15.3 s

步骤1 — 下载SRA输入文件(直接.lite.1链接)#

这里我们下载几个示例SRA run。在此工作流程中,我们使用以.lite.1结尾的NCBI SRA直接下载链接

重要提示

  • 如果文件已存在,ov.datasets.download_data(..., dir=...)将跳过下载。

  • 在许多环境中,您也可以通过ov.alignment.prefetch()下载,然后使用fastq-dump/fasterq-dump进行转换。

  • 如果您的网络阻止了直接SRA链接,请切换到prefetch或将文件镜像到可访问的位置。

links=[
    'https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos9/sra-pub-zq-922/SRR012/12544/SRR12544419/SRR12544419.lite.1',#no
    'https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos9/sra-pub-zq-924/SRR012/12544/SRR12544421/SRR12544421.lite.1',#no
    'https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos9/sra-pub-zq-922/SRR012/12544/SRR12544433/SRR12544433.lite.1'#yes
    'https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos9/sra-pub-zq-922/SRR012/12544/SRR12544435/SRR12544435.lite.1'#yes
]
for link in links:
    ov.datasets.download_data(link,dir='./data')
🔍 Downloading data to ./data/SRR12544433.lite.1
✅ Download completed

步骤2 — 将SRA转换为双端FASTQ(单个示例)#

此单元格展示了使用ov.alignment.parallel_fastq_dump一个SRA文件转换为双端FASTQ

重要提示

  • 设置split_files=True以获得双端输出(*_1.fastq.gz*_2.fastq.gz)。

  • 将快速本地磁盘用于tmpdir以减少I/O开销。

  • 线程数控制转换速度;根据CPU配置选择合适的值。

ov.alignment.parallel_fastq_dump(
    sra_id='./data/SRR12544421.lite.1',
    threads=12,
    outdir='./data/SRR12544421',
    tmpdir='./tmp',
    split_files=True,
    gzip=True,
)
🚀 Starting parallel-fastq-dump for ./data/SRR12544421.lite.1
    Threads: 12
    Output directory: ./data/SRR12544421
    Added to PATH: /home/groups/xiaojie/steorra/env/omicverse/bin
>> /home/groups/xiaojie/steorra/env/omicverse/bin/parallel-fastq-dump --sra-id ./data/SRR12544421.lite.1 --threads 12 --outdir ./data/SRR12544421 --tmpdir ./tmp --minSpotId 1 --split-files --gzip
2026-01-29 23:56:27,879 - SRR ids: ['./data/SRR12544421.lite.1']
2026-01-29 23:56:27,879 - extra args: ['--split-files', '--gzip']
2026-01-29 23:56:27,880 - tempdir: ./tmp/pfd_0armgw5l
2026-01-29 23:56:27,880 - CMD: sra-stat --meta --quick ./data/SRR12544421.lite.1
2026-01-29 23:56:27,982 - ./data/SRR12544421.lite.1 spots: 10963094
2026-01-29 23:56:27,982 - blocks: [[1, 913591], [913592, 1827182], [1827183, 2740773], [2740774, 3654364], [3654365, 4567955], [4567956, 5481546], [5481547, 6395137], [6395138, 7308728], [7308729, 8222319], [8222320, 9135910], [9135911, 10049501], [10049502, 10963094]]
2026-01-29 23:56:27,982 - CMD: fastq-dump -N 1 -X 913591 -O ./tmp/pfd_0armgw5l/0 --split-files --gzip ./data/SRR12544421.lite.1
2026-01-29 23:56:27,985 - CMD: fastq-dump -N 913592 -X 1827182 -O ./tmp/pfd_0armgw5l/1 --split-files --gzip ./data/SRR12544421.lite.1
2026-01-29 23:56:27,986 - CMD: fastq-dump -N 1827183 -X 2740773 -O ./tmp/pfd_0armgw5l/2 --split-files --gzip ./data/SRR12544421.lite.1
2026-01-29 23:56:27,987 - CMD: fastq-dump -N 2740774 -X 3654364 -O ./tmp/pfd_0armgw5l/3 --split-files --gzip ./data/SRR12544421.lite.1
2026-01-29 23:56:27,989 - CMD: fastq-dump -N 3654365 -X 4567955 -O ./tmp/pfd_0armgw5l/4 --split-files --gzip ./data/SRR12544421.lite.1
2026-01-29 23:56:27,991 - CMD: fastq-dump -N 4567956 -X 5481546 -O ./tmp/pfd_0armgw5l/5 --split-files --gzip ./data/SRR12544421.lite.1
2026-01-29 23:56:27,992 - CMD: fastq-dump -N 5481547 -X 6395137 -O ./tmp/pfd_0armgw5l/6 --split-files --gzip ./data/SRR12544421.lite.1
2026-01-29 23:56:27,994 - CMD: fastq-dump -N 6395138 -X 7308728 -O ./tmp/pfd_0armgw5l/7 --split-files --gzip ./data/SRR12544421.lite.1
2026-01-29 23:56:27,996 - CMD: fastq-dump -N 7308729 -X 8222319 -O ./tmp/pfd_0armgw5l/8 --split-files --gzip ./data/SRR12544421.lite.1
2026-01-29 23:56:27,998 - CMD: fastq-dump -N 8222320 -X 9135910 -O ./tmp/pfd_0armgw5l/9 --split-files --gzip ./data/SRR12544421.lite.1
2026-01-29 23:56:28,001 - CMD: fastq-dump -N 9135911 -X 10049501 -O ./tmp/pfd_0armgw5l/10 --split-files --gzip ./data/SRR12544421.lite.1
2026-01-29 23:56:28,003 - CMD: fastq-dump -N 10049502 -X 10963094 -O ./tmp/pfd_0armgw5l/11 --split-files --gzip ./data/SRR12544421.lite.1
Read 913591 spots for ./data/SRR12544421.lite.1
Written 913591 spots for ./data/SRR12544421.lite.1
Read 913591 spots for ./data/SRR12544421.lite.1
Written 913591 spots for ./data/SRR12544421.lite.1
Read 913591 spots for ./data/SRR12544421.lite.1
Written 913591 spots for ./data/SRR12544421.lite.1
Read 913593 spots for ./data/SRR12544421.lite.1
Written 913593 spots for ./data/SRR12544421.lite.1
Read 913591 spots for ./data/SRR12544421.lite.1
Written 913591 spots for ./data/SRR12544421.lite.1
Read 913591 spots for ./data/SRR12544421.lite.1
Written 913591 spots for ./data/SRR12544421.lite.1
Read 913591 spots for ./data/SRR12544421.lite.1
Written 913591 spots for ./data/SRR12544421.lite.1
Read 913591 spots for ./data/SRR12544421.lite.1
Written 913591 spots for ./data/SRR12544421.lite.1
Read 913591 spots for ./data/SRR12544421.lite.1
Written 913591 spots for ./data/SRR12544421.lite.1
Read 913591 spots for ./data/SRR12544421.lite.1
Written 913591 spots for ./data/SRR12544421.lite.1
Read 913591 spots for ./data/SRR12544421.lite.1
Written 913591 spots for ./data/SRR12544421.lite.1
Read 913591 spots for ./data/SRR12544421.lite.1
Written 913591 spots for ./data/SRR12544421.lite.1
✓ parallel-fastq-dump completed successfully!
{'sra_id': './data/SRR12544421.lite.1',
 'threads': 12,
 'outdir': './data/SRR12544421',
 'split_files': True,
 'gzip': True}

步骤2b — 批量转换多个SRA文件#

此循环使用相同设置将多个SRA文件转换为双端FASTQ。

提示:如果有大量样本,建议使用作业调度器(如Slurm)跨节点并行处理,而不是在单个节点上无限增加线程数。

for sra in [
    'SRR12544433','SRR12544435'
]:
    ov.alignment.parallel_fastq_dump(
        sra_id=f'./data/{sra}.lite.1',
        threads=12,
        outdir=f'./data/{sra}',
        tmpdir='./tmp',
        split_files=True,
        gzip=True,
    )
🚀 Starting parallel-fastq-dump for ./data/SRR12544433.lite.1
    Threads: 12
    Output directory: ./data/SRR12544433
    Added to PATH: /home/groups/xiaojie/steorra/env/omicverse/bin
>> /home/groups/xiaojie/steorra/env/omicverse/bin/parallel-fastq-dump --sra-id ./data/SRR12544433.lite.1 --threads 12 --outdir ./data/SRR12544433 --tmpdir ./tmp --minSpotId 1 --split-files --gzip
2026-01-29 23:58:48,245 - SRR ids: ['./data/SRR12544433.lite.1']
2026-01-29 23:58:48,245 - extra args: ['--split-files', '--gzip']
2026-01-29 23:58:48,246 - tempdir: ./tmp/pfd_8tgtda45
2026-01-29 23:58:48,246 - CMD: sra-stat --meta --quick ./data/SRR12544433.lite.1
2026-01-29 23:58:48,358 - ./data/SRR12544433.lite.1 spots: 16602881
2026-01-29 23:58:48,358 - blocks: [[1, 1383573], [1383574, 2767146], [2767147, 4150719], [4150720, 5534292], [5534293, 6917865], [6917866, 8301438], [8301439, 9685011], [9685012, 11068584], [11068585, 12452157], [12452158, 13835730], [13835731, 15219303], [15219304, 16602881]]
2026-01-29 23:58:48,359 - CMD: fastq-dump -N 1 -X 1383573 -O ./tmp/pfd_8tgtda45/0 --split-files --gzip ./data/SRR12544433.lite.1
2026-01-29 23:58:48,361 - CMD: fastq-dump -N 1383574 -X 2767146 -O ./tmp/pfd_8tgtda45/1 --split-files --gzip ./data/SRR12544433.lite.1
2026-01-29 23:58:48,362 - CMD: fastq-dump -N 2767147 -X 4150719 -O ./tmp/pfd_8tgtda45/2 --split-files --gzip ./data/SRR12544433.lite.1
2026-01-29 23:58:48,364 - CMD: fastq-dump -N 4150720 -X 5534292 -O ./tmp/pfd_8tgtda45/3 --split-files --gzip ./data/SRR12544433.lite.1
2026-01-29 23:58:48,365 - CMD: fastq-dump -N 5534293 -X 6917865 -O ./tmp/pfd_8tgtda45/4 --split-files --gzip ./data/SRR12544433.lite.1
2026-01-29 23:58:48,367 - CMD: fastq-dump -N 6917866 -X 8301438 -O ./tmp/pfd_8tgtda45/5 --split-files --gzip ./data/SRR12544433.lite.1
2026-01-29 23:58:48,368 - CMD: fastq-dump -N 8301439 -X 9685011 -O ./tmp/pfd_8tgtda45/6 --split-files --gzip ./data/SRR12544433.lite.1
2026-01-29 23:58:48,370 - CMD: fastq-dump -N 9685012 -X 11068584 -O ./tmp/pfd_8tgtda45/7 --split-files --gzip ./data/SRR12544433.lite.1
2026-01-29 23:58:48,372 - CMD: fastq-dump -N 11068585 -X 12452157 -O ./tmp/pfd_8tgtda45/8 --split-files --gzip ./data/SRR12544433.lite.1
2026-01-29 23:58:48,373 - CMD: fastq-dump -N 12452158 -X 13835730 -O ./tmp/pfd_8tgtda45/9 --split-files --gzip ./data/SRR12544433.lite.1
2026-01-29 23:58:48,375 - CMD: fastq-dump -N 13835731 -X 15219303 -O ./tmp/pfd_8tgtda45/10 --split-files --gzip ./data/SRR12544433.lite.1
2026-01-29 23:58:48,377 - CMD: fastq-dump -N 15219304 -X 16602881 -O ./tmp/pfd_8tgtda45/11 --split-files --gzip ./data/SRR12544433.lite.1
Read 1383573 spots for ./data/SRR12544433.lite.1
Written 1383573 spots for ./data/SRR12544433.lite.1
Read 1383573 spots for ./data/SRR12544433.lite.1
Written 1383573 spots for ./data/SRR12544433.lite.1
Read 1383573 spots for ./data/SRR12544433.lite.1
Written 1383573 spots for ./data/SRR12544433.lite.1
Read 1383578 spots for ./data/SRR12544433.lite.1
Written 1383578 spots for ./data/SRR12544433.lite.1
Read 1383573 spots for ./data/SRR12544433.lite.1
Written 1383573 spots for ./data/SRR12544433.lite.1
Read 1383573 spots for ./data/SRR12544433.lite.1
Written 1383573 spots for ./data/SRR12544433.lite.1
Read 1383573 spots for ./data/SRR12544433.lite.1
Written 1383573 spots for ./data/SRR12544433.lite.1
Read 1383573 spots for ./data/SRR12544433.lite.1
Written 1383573 spots for ./data/SRR12544433.lite.1
Read 1383573 spots for ./data/SRR12544433.lite.1
Written 1383573 spots for ./data/SRR12544433.lite.1
Read 1383573 spots for ./data/SRR12544433.lite.1
Written 1383573 spots for ./data/SRR12544433.lite.1
Read 1383573 spots for ./data/SRR12544433.lite.1
Written 1383573 spots for ./data/SRR12544433.lite.1
Read 1383573 spots for ./data/SRR12544433.lite.1
Written 1383573 spots for ./data/SRR12544433.lite.1
✓ parallel-fastq-dump completed successfully!
🚀 Starting parallel-fastq-dump for ./data/SRR12544435.lite.1
    Threads: 12
    Output directory: ./data/SRR12544435
    Added to PATH: /home/groups/xiaojie/steorra/env/omicverse/bin
>> /home/groups/xiaojie/steorra/env/omicverse/bin/parallel-fastq-dump --sra-id ./data/SRR12544435.lite.1 --threads 12 --outdir ./data/SRR12544435 --tmpdir ./tmp --minSpotId 1 --split-files --gzip
2026-01-29 23:59:23,325 - SRR ids: ['./data/SRR12544435.lite.1']
2026-01-29 23:59:23,325 - extra args: ['--split-files', '--gzip']
2026-01-29 23:59:23,326 - tempdir: ./tmp/pfd_t7s3okn4
2026-01-29 23:59:23,326 - CMD: sra-stat --meta --quick ./data/SRR12544435.lite.1
2026-01-29 23:59:23,433 - ./data/SRR12544435.lite.1 spots: 16486139
2026-01-29 23:59:23,433 - blocks: [[1, 1373844], [1373845, 2747688], [2747689, 4121532], [4121533, 5495376], [5495377, 6869220], [6869221, 8243064], [8243065, 9616908], [9616909, 10990752], [10990753, 12364596], [12364597, 13738440], [13738441, 15112284], [15112285, 16486139]]
2026-01-29 23:59:23,433 - CMD: fastq-dump -N 1 -X 1373844 -O ./tmp/pfd_t7s3okn4/0 --split-files --gzip ./data/SRR12544435.lite.1
2026-01-29 23:59:23,435 - CMD: fastq-dump -N 1373845 -X 2747688 -O ./tmp/pfd_t7s3okn4/1 --split-files --gzip ./data/SRR12544435.lite.1
2026-01-29 23:59:23,436 - CMD: fastq-dump -N 2747689 -X 4121532 -O ./tmp/pfd_t7s3okn4/2 --split-files --gzip ./data/SRR12544435.lite.1
2026-01-29 23:59:23,438 - CMD: fastq-dump -N 4121533 -X 5495376 -O ./tmp/pfd_t7s3okn4/3 --split-files --gzip ./data/SRR12544435.lite.1
2026-01-29 23:59:23,439 - CMD: fastq-dump -N 5495377 -X 6869220 -O ./tmp/pfd_t7s3okn4/4 --split-files --gzip ./data/SRR12544435.lite.1
2026-01-29 23:59:23,441 - CMD: fastq-dump -N 6869221 -X 8243064 -O ./tmp/pfd_t7s3okn4/5 --split-files --gzip ./data/SRR12544435.lite.1
2026-01-29 23:59:23,443 - CMD: fastq-dump -N 8243065 -X 9616908 -O ./tmp/pfd_t7s3okn4/6 --split-files --gzip ./data/SRR12544435.lite.1
2026-01-29 23:59:23,445 - CMD: fastq-dump -N 9616909 -X 10990752 -O ./tmp/pfd_t7s3okn4/7 --split-files --gzip ./data/SRR12544435.lite.1
2026-01-29 23:59:23,451 - CMD: fastq-dump -N 10990753 -X 12364596 -O ./tmp/pfd_t7s3okn4/8 --split-files --gzip ./data/SRR12544435.lite.1
2026-01-29 23:59:23,453 - CMD: fastq-dump -N 12364597 -X 13738440 -O ./tmp/pfd_t7s3okn4/9 --split-files --gzip ./data/SRR12544435.lite.1
2026-01-29 23:59:23,455 - CMD: fastq-dump -N 13738441 -X 15112284 -O ./tmp/pfd_t7s3okn4/10 --split-files --gzip ./data/SRR12544435.lite.1
2026-01-29 23:59:23,456 - CMD: fastq-dump -N 15112285 -X 16486139 -O ./tmp/pfd_t7s3okn4/11 --split-files --gzip ./data/SRR12544435.lite.1
Read 1373844 spots for ./data/SRR12544435.lite.1
Written 1373844 spots for ./data/SRR12544435.lite.1
Read 1373844 spots for ./data/SRR12544435.lite.1
Written 1373844 spots for ./data/SRR12544435.lite.1
Read 1373844 spots for ./data/SRR12544435.lite.1
Written 1373844 spots for ./data/SRR12544435.lite.1
Read 1373855 spots for ./data/SRR12544435.lite.1
Written 1373855 spots for ./data/SRR12544435.lite.1
Read 1373844 spots for ./data/SRR12544435.lite.1
Written 1373844 spots for ./data/SRR12544435.lite.1
Read 1373844 spots for ./data/SRR12544435.lite.1
Written 1373844 spots for ./data/SRR12544435.lite.1
Read 1373844 spots for ./data/SRR12544435.lite.1
Written 1373844 spots for ./data/SRR12544435.lite.1
Read 1373844 spots for ./data/SRR12544435.lite.1
Written 1373844 spots for ./data/SRR12544435.lite.1
Read 1373844 spots for ./data/SRR12544435.lite.1
Written 1373844 spots for ./data/SRR12544435.lite.1
Read 1373844 spots for ./data/SRR12544435.lite.1
Written 1373844 spots for ./data/SRR12544435.lite.1
Read 1373844 spots for ./data/SRR12544435.lite.1
Written 1373844 spots for ./data/SRR12544435.lite.1
Read 1373844 spots for ./data/SRR12544435.lite.1
Written 1373844 spots for ./data/SRR12544435.lite.1
✓ parallel-fastq-dump completed successfully!

步骤3 — 下载参考基因组 + 注释#

我们从Ensembl下载GRCh38参考FASTA和GTF

重要提示

  • 保持FASTA和GTF来自相同的Ensembl版本以避免转录本ID不匹配。

  • 确保有足够的磁盘空间(人类参考文件较大)。

ov.datasets.download_data('ftp://ftp.ensembl.org/pub/release-108/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz',
                         dir='./genomes')
ov.datasets.download_data('ftp://ftp.ensembl.org/pub/release-108/gtf/homo_sapiens/Homo_sapiens.GRCh38.108.gtf.gz',
                         dir='./genomes')
🔍 Downloading data to ./genomes/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
✅ Download completed
🔍 Downloading data to ./genomes/Homo_sapiens.GRCh38.108.gtf.gz
✅ Download completed
'./genomes/Homo_sapiens.GRCh38.108.gtf.gz'

步骤4 — 构建kb-python参考(索引 + t2g + cDNA)#

ov.alignment.single.ref(...)准备kb-python参考资源:

  • index.idx:kallisto索引

  • t2g.txt:用于将计数聚合到基因的转录本到基因映射

  • cdna.fa:从注释派生的转录组FASTA

重要提示

  • 首次构建可能耗时较长;构建后可对所有样本复用相同的输出。

  • 将索引放在快速存储设备上以加快定量速度。

result = ov.alignment.single.ref(
    fasta_paths='genomes/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz',  #input
    gtf_paths='genomes/Homo_sapiens.GRCh38.108.gtf.gz', #input
    index_path='pbmc_1k_v3/index.idx', #output
    t2g_path='pbmc_1k_v3/t2g.txt', #output
    cdna_path='pbmc_1k_v3/cdna.fa', #output
    temp_dir='tmp',
    overwrite=True,
)
print(result.keys())
🚀 Starting ref workflow: standard
    Using temporary directory: tmp-kb-e9513f2c23aa46d1961cde1d7afeba6b
>> /home/groups/xiaojie/steorra/env/omicverse/bin/kb ref --tmp tmp-kb-e9513f2c23aa46d1961cde1d7afeba6b -i pbmc_1k_v3/index.idx -g pbmc_1k_v3/t2g.txt -t 8 --overwrite --d-list-overhang 1 -f1 pbmc_1k_v3/cdna.fa pbmc_1k_v3/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz pbmc_1k_v3/Homo_sapiens.GRCh38.108.gtf.gz
[2026-01-29 22:21:41,856]    INFO [ref] Preparing pbmc_1k_v3/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz, pbmc_1k_v3/Homo_sapiens.GRCh38.108.gtf.gz
[2026-01-29 22:22:33,317]    INFO [ref] Splitting genome pbmc_1k_v3/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz into cDNA at /scratch/users/steorra/analysis/26_omic_protocol/tmp-kb-e9513f2c23aa46d1961cde1d7afeba6b/tmpoyd3o7ta
[2026-01-29 22:23:30,907]    INFO [ref] Concatenating 1 cDNAs to pbmc_1k_v3/cdna.fa
[2026-01-29 22:23:31,551]    INFO [ref] Creating transcript-to-gene mapping at pbmc_1k_v3/t2g.txt
[2026-01-29 22:23:33,230]    INFO [ref] Indexing pbmc_1k_v3/cdna.fa to pbmc_1k_v3/index.idx
✓ ref workflow completed!
dict_keys(['workflow', 'technology', 'parameters', 'index_path', 't2g_path', 'cdna_path'])

Step 5 — Quantify one sample with kb-python (technology = BULK)#

This is an example call for one sample.

CRITICAL

  • For bulk RNA-seq, each sample is treated as one “library” (no barcode filtering).

  • h5ad=True writes an AnnData object for convenient downstream analysis in OmicVerse/Scanpy.

  • If your environment does not expose count(...) directly, the equivalent OmicVerse entry point is typically ov.alignment.count(...) (the rest of the arguments stay the same).

result = count(
    fastq_paths=[
        "./data/SRR12544419/SRR12544419.lite.1_1.fastq.gz", 
        "./data/SRR12544419/SRR12544419.lite.1_2.fastq.gz",
    ],
    index_path="pbmc_1k_v3/index.idx",
    t2g_path="pbmc_1k_v3/t2g.txt",
    technology='BULK', # technology
    output_path="results/pbmc_test",
    h5ad=True,
    filter_barcodes=False,
    threads=12,
    parity="paired",          # 
    strand="unstranded",      # 
)
print(result.keys())
🚀 Starting count workflow: standard
    Technology: BULK
    Output directory: .
    Using temporary directory: tmp-kb-6e8cfea103a042e79f4e83e5bd1facf2
>> /home/groups/xiaojie/steorra/env/omicverse/bin/kb count --tmp tmp-kb-6e8cfea103a042e79f4e83e5bd1facf2 -i pbmc_1k_v3/index.idx -g pbmc_1k_v3/t2g.txt -x BULK -o . -t 12 -m 2G --h5ad --parity paired --strand unstranded ./data/SRR12544419/SRR12544419.lite.1_1.fastq.gz ./data/SRR12544419/SRR12544419.lite.1_2.fastq.gz
[2026-01-29 22:53:46,113]    INFO [count] Using index pbmc_1k_v3/index.idx to generate BUS file to . from
[2026-01-29 22:53:46,113]    INFO [count]         ./data/SRR12544419/SRR12544419.lite.1_1.fastq.gz
[2026-01-29 22:53:46,113]    INFO [count]         ./data/SRR12544419/SRR12544419.lite.1_2.fastq.gz
[2026-01-29 22:54:23,070]    INFO [count] Sorting BUS file ./output.bus to tmp-kb-6e8cfea103a042e79f4e83e5bd1facf2/output.s.bus
[2026-01-29 22:54:25,476]    INFO [count] Inspecting BUS file tmp-kb-6e8cfea103a042e79f4e83e5bd1facf2/output.s.bus
[2026-01-29 22:54:26,580]    INFO [count] Generating count matrix ./counts_unfiltered/cells_x_genes from BUS file tmp-kb-6e8cfea103a042e79f4e83e5bd1facf2/output.s.bus
[2026-01-29 22:54:28,602]    INFO [count] Writing gene names to file ./counts_unfiltered/cells_x_genes.genes.names.txt
[2026-01-29 22:54:28,835] WARNING [count] 22736 gene IDs do not have corresponding valid gene names. These genes will use their gene IDs instead.
[2026-01-29 22:54:28,860]    INFO [count] Reading matrix ./counts_unfiltered/cells_x_genes.mtx
[2026-01-29 22:54:28,917]    INFO [count] Writing matrix to h5ad ./counts_unfiltered/adata.h5ad
✓ count workflow completed!
dict_keys(['workflow', 'technology', 'output_path', 'parameters'])

步骤5b — 定量所有样本(循环)#

我们对每个样本运行kb-python定量,并将结果保存到每个样本的输出文件夹中。

提示:在实践中,建议为output_path设置清晰的命名规范(项目/样本/日期)以保持结果的可重现性。

for sra in [
    'SRR12544419','SRR12544421','SRR12544433','SRR12544435'
]:
    result = ov.alignment.count(
        fastq_paths=[
            f"./data/{sra}/{sra}.lite.1_1.fastq.gz", 
            f"./data/{sra}/{sra}.lite.1_2.fastq.gz",
        ],
        index_path="pbmc_1k_v3/index.idx",
        t2g_path="pbmc_1k_v3/t2g.txt",
        technology='BULK', # technology
        output_path=f"results/{sra}/",
        h5ad=True,
        filter_barcodes=False,
        threads=12,
        parity="paired",          # ✅关键
        strand="unstranded",      # ✅建议显式写
    )
    print(result.keys())
🚀 Starting count workflow: standard
    Technology: BULK
    Output directory: results/SRR12544419/
    Using temporary directory: tmp-kb-36a26e9c1a61444387dc9ef0c9f177dc
>> /home/groups/xiaojie/steorra/env/omicverse/bin/kb count --tmp tmp-kb-36a26e9c1a61444387dc9ef0c9f177dc -i pbmc_1k_v3/index.idx -g pbmc_1k_v3/t2g.txt -x BULK -o results/SRR12544419/ -t 12 -m 2G --h5ad --parity paired --strand unstranded ./data/SRR12544419/SRR12544419.lite.1_1.fastq.gz ./data/SRR12544419/SRR12544419.lite.1_2.fastq.gz
[2026-01-30 00:03:00,630]    INFO [count] Using index pbmc_1k_v3/index.idx to generate BUS file to results/SRR12544419/ from
[2026-01-30 00:03:00,630]    INFO [count]         ./data/SRR12544419/SRR12544419.lite.1_1.fastq.gz
[2026-01-30 00:03:00,630]    INFO [count]         ./data/SRR12544419/SRR12544419.lite.1_2.fastq.gz
[2026-01-30 00:03:35,180]    INFO [count] Sorting BUS file results/SRR12544419/output.bus to tmp-kb-36a26e9c1a61444387dc9ef0c9f177dc/output.s.bus
[2026-01-30 00:03:37,594]    INFO [count] Inspecting BUS file tmp-kb-36a26e9c1a61444387dc9ef0c9f177dc/output.s.bus
[2026-01-30 00:03:38,702]    INFO [count] Generating count matrix results/SRR12544419/counts_unfiltered/cells_x_genes from BUS file tmp-kb-36a26e9c1a61444387dc9ef0c9f177dc/output.s.bus
[2026-01-30 00:03:41,325]    INFO [count] Writing gene names to file results/SRR12544419/counts_unfiltered/cells_x_genes.genes.names.txt
[2026-01-30 00:03:41,560] WARNING [count] 22736 gene IDs do not have corresponding valid gene names. These genes will use their gene IDs instead.
[2026-01-30 00:03:41,586]    INFO [count] Reading matrix results/SRR12544419/counts_unfiltered/cells_x_genes.mtx
[2026-01-30 00:03:41,643]    INFO [count] Writing matrix to h5ad results/SRR12544419/counts_unfiltered/adata.h5ad
✓ count workflow completed!
dict_keys(['workflow', 'technology', 'output_path', 'parameters'])
🚀 Starting count workflow: standard
    Technology: BULK
    Output directory: results/SRR12544421/
    Using temporary directory: tmp-kb-7dc3eb1f1a7f46369c65439c1df20d7e
>> /home/groups/xiaojie/steorra/env/omicverse/bin/kb count --tmp tmp-kb-7dc3eb1f1a7f46369c65439c1df20d7e -i pbmc_1k_v3/index.idx -g pbmc_1k_v3/t2g.txt -x BULK -o results/SRR12544421/ -t 12 -m 2G --h5ad --parity paired --strand unstranded ./data/SRR12544421/SRR12544421.lite.1_1.fastq.gz ./data/SRR12544421/SRR12544421.lite.1_2.fastq.gz
[2026-01-30 00:03:53,957]    INFO [count] Using index pbmc_1k_v3/index.idx to generate BUS file to results/SRR12544421/ from
[2026-01-30 00:03:53,957]    INFO [count]         ./data/SRR12544421/SRR12544421.lite.1_1.fastq.gz
[2026-01-30 00:03:53,957]    INFO [count]         ./data/SRR12544421/SRR12544421.lite.1_2.fastq.gz
[2026-01-30 00:04:18,493]    INFO [count] Sorting BUS file results/SRR12544421/output.bus to tmp-kb-7dc3eb1f1a7f46369c65439c1df20d7e/output.s.bus
[2026-01-30 00:04:20,598]    INFO [count] Inspecting BUS file tmp-kb-7dc3eb1f1a7f46369c65439c1df20d7e/output.s.bus
[2026-01-30 00:04:21,702]    INFO [count] Generating count matrix results/SRR12544421/counts_unfiltered/cells_x_genes from BUS file tmp-kb-7dc3eb1f1a7f46369c65439c1df20d7e/output.s.bus
[2026-01-30 00:04:23,724]    INFO [count] Writing gene names to file results/SRR12544421/counts_unfiltered/cells_x_genes.genes.names.txt
[2026-01-30 00:04:23,956] WARNING [count] 22736 gene IDs do not have corresponding valid gene names. These genes will use their gene IDs instead.
[2026-01-30 00:04:23,981]    INFO [count] Reading matrix results/SRR12544421/counts_unfiltered/cells_x_genes.mtx
[2026-01-30 00:04:24,037]    INFO [count] Writing matrix to h5ad results/SRR12544421/counts_unfiltered/adata.h5ad
✓ count workflow completed!
dict_keys(['workflow', 'technology', 'output_path', 'parameters'])
🚀 Starting count workflow: standard
    Technology: BULK
    Output directory: results/SRR12544433/
    Using temporary directory: tmp-kb-f36592f2f361482c8829daecfd59d0a4
>> /home/groups/xiaojie/steorra/env/omicverse/bin/kb count --tmp tmp-kb-f36592f2f361482c8829daecfd59d0a4 -i pbmc_1k_v3/index.idx -g pbmc_1k_v3/t2g.txt -x BULK -o results/SRR12544433/ -t 12 -m 2G --h5ad --parity paired --strand unstranded ./data/SRR12544433/SRR12544433.lite.1_1.fastq.gz ./data/SRR12544433/SRR12544433.lite.1_2.fastq.gz
[2026-01-30 00:04:33,656]    INFO [count] Using index pbmc_1k_v3/index.idx to generate BUS file to results/SRR12544433/ from
[2026-01-30 00:04:33,656]    INFO [count]         ./data/SRR12544433/SRR12544433.lite.1_1.fastq.gz
[2026-01-30 00:04:33,656]    INFO [count]         ./data/SRR12544433/SRR12544433.lite.1_2.fastq.gz
[2026-01-30 00:05:04,901]    INFO [count] Sorting BUS file results/SRR12544433/output.bus to tmp-kb-f36592f2f361482c8829daecfd59d0a4/output.s.bus
[2026-01-30 00:05:07,209]    INFO [count] Inspecting BUS file tmp-kb-f36592f2f361482c8829daecfd59d0a4/output.s.bus
[2026-01-30 00:05:08,313]    INFO [count] Generating count matrix results/SRR12544433/counts_unfiltered/cells_x_genes from BUS file tmp-kb-f36592f2f361482c8829daecfd59d0a4/output.s.bus
[2026-01-30 00:05:10,436]    INFO [count] Writing gene names to file results/SRR12544433/counts_unfiltered/cells_x_genes.genes.names.txt
[2026-01-30 00:05:10,667] WARNING [count] 22736 gene IDs do not have corresponding valid gene names. These genes will use their gene IDs instead.
[2026-01-30 00:05:10,693]    INFO [count] Reading matrix results/SRR12544433/counts_unfiltered/cells_x_genes.mtx
[2026-01-30 00:05:10,749]    INFO [count] Writing matrix to h5ad results/SRR12544433/counts_unfiltered/adata.h5ad
✓ count workflow completed!
dict_keys(['workflow', 'technology', 'output_path', 'parameters'])
🚀 Starting count workflow: standard
    Technology: BULK
    Output directory: results/SRR12544435/
    Using temporary directory: tmp-kb-7e11e84640b14489ad063057ba8b2738
>> /home/groups/xiaojie/steorra/env/omicverse/bin/kb count --tmp tmp-kb-7e11e84640b14489ad063057ba8b2738 -i pbmc_1k_v3/index.idx -g pbmc_1k_v3/t2g.txt -x BULK -o results/SRR12544435/ -t 12 -m 2G --h5ad --parity paired --strand unstranded ./data/SRR12544435/SRR12544435.lite.1_1.fastq.gz ./data/SRR12544435/SRR12544435.lite.1_2.fastq.gz
[2026-01-30 00:05:21,328]    INFO [count] Using index pbmc_1k_v3/index.idx to generate BUS file to results/SRR12544435/ from
[2026-01-30 00:05:21,328]    INFO [count]         ./data/SRR12544435/SRR12544435.lite.1_1.fastq.gz
[2026-01-30 00:05:21,328]    INFO [count]         ./data/SRR12544435/SRR12544435.lite.1_2.fastq.gz
[2026-01-30 00:05:52,473]    INFO [count] Sorting BUS file results/SRR12544435/output.bus to tmp-kb-7e11e84640b14489ad063057ba8b2738/output.s.bus
[2026-01-30 00:05:54,882]    INFO [count] Inspecting BUS file tmp-kb-7e11e84640b14489ad063057ba8b2738/output.s.bus
[2026-01-30 00:05:55,986]    INFO [count] Generating count matrix results/SRR12544435/counts_unfiltered/cells_x_genes from BUS file tmp-kb-7e11e84640b14489ad063057ba8b2738/output.s.bus
[2026-01-30 00:05:58,108]    INFO [count] Writing gene names to file results/SRR12544435/counts_unfiltered/cells_x_genes.genes.names.txt
[2026-01-30 00:05:58,341] WARNING [count] 22736 gene IDs do not have corresponding valid gene names. These genes will use their gene IDs instead.
[2026-01-30 00:05:58,366]    INFO [count] Reading matrix results/SRR12544435/counts_unfiltered/cells_x_genes.mtx
[2026-01-30 00:05:58,424]    INFO [count] Writing matrix to h5ad results/SRR12544435/counts_unfiltered/adata.h5ad
✓ count workflow completed!
dict_keys(['workflow', 'technology', 'output_path', 'parameters'])

步骤6 — 加载每个样本的结果并统一基因标识符#

每次kb-python运行都会生成一个adata.h5ad文件及配套的基因名称文件。 我们加载每个样本的AnnData对象,并确保adata.var同时包含:

  • gene_name

  • gene_id

重要提示 不同流程中的基因命名规范可能不同;明确设置adata.var['gene_name']并将其用作索引,可使下游合并和可视化更加稳健。

ad_dict={}
for sra in [
    'SRR12544419','SRR12544421','SRR12544433','SRR12544435'
]:
    ad=ov.read(f'./results/{sra}/counts_unfiltered/adata.h5ad')
    gene_name=ov.pd.read_csv(
        f'./results/{sra}/counts_unfiltered/cells_x_genes.genes.names.txt',
        header=None
    )
    ad.var['gene_name']=gene_name[0].tolist()
    ad.var['gene_id']=ad.var.index
    ad.var.index=ad.var['gene_name']
    ad.var_names_make_unique()
    ad.obs['sra']=sra
    ad_dict[sra]=ad

Step 7 — Merge samples and define phenotype labels#

We concatenate all samples into one AnnData and create a Group column (e.g., disease vs healthy) for downstream DE.

CRITICAL For bulk RNA-seq, you typically want one observation per sample. In this demo workflow, each sample’s output is concatenated and then labeled via adata.obs['Group'].

adata=ov.concat(ad_dict)
adata.obs_names_make_unique()
adata.obs['Group']=['no','no','yes','yes']
adata
AnnData object with n_obs × n_vars = 4 × 62703
    obs: 'sra', 'Group'

步骤7b — 快速合理性检查(基因表达)#

对标记基因进行快速检查,有助于确认矩阵已正确加载,并在预期位置包含非零计数。

adata[:,'CD3D'].X.toarray()
array([[623.],
       [612.],
       [ 98.],
       [324.]])

步骤8 — 转换为DE分析用的计数矩阵#

ov.bulk.pyDEG期望一个基因 × 样本计数矩阵(pandas DataFrame)。 这里我们将AnnData转换为DataFrame并进行转置以符合该约定。

重要提示 DESeq2风格的方法需要原始整数计数。不要在DESeq2之前应用对数归一化。

data=adata.to_df().T
data.head()
barcode AAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAA-1 AAAAAAAAAAAAAAAA-2 AAAAAAAAAAAAAAAA-3
gene_name
ATAD3B 90.0 70.0 30.0 115.0
DDX11L17 4.0 0.0 0.0 0.0
ENSG00000228037.1 9.0 8.0 1.0 8.0
PRDM16 0.0 0.0 0.0 0.0
ENSG00000284616.1 0.0 0.0 0.0 0.0

步骤9 — 创建pyDEG对象#

我们将计数矩阵封装到OmicVerse的pyDEG对象中,用于差异表达分析。

dds=ov.bulk.pyDEG(data)

步骤10 — 差异表达(DESeq2)#

我们使用DESeq2后端运行DE分析。

重要提示

  • DESeq2期望原始整数计数,并在内部估算大小因子/离散度。

  • 这里的treatment_groupscontrol_groups指的是提供矩阵中的列/样本标识符。在您自己的数据中,请替换为适当的样本ID(通常是样本名称)。

dds.drop_duplicates_index()
print('... drop_duplicates_index success')
treatment_groups=['AAAAAAAAAAAAAAAA-2','AAAAAAAAAAAAAAAA-3']
control_groups=['AAAAAAAAAAAAAAAA','AAAAAAAAAAAAAAAA-1']
result=dds.deg_analysis(treatment_groups,control_groups,method='DEseq2')
... drop_duplicates_index success
⚙️ You are using DEseq2 method for differential expression analysis.
⏰ Start to create DeseqDataSet...
logres_prior=1.2473932639674563, sigma_prior=0.25
Log2 fold change & Wald test p-value: condition Treatment vs Control
                        baseMean  log2FoldChange     lfcSE      stat  \
gene_name                                                              
B2M                178491.812733        0.065096  0.418607  0.155506   
S100A9             170557.796728        0.378402  0.248436  1.523133   
S100A8             154015.888583        0.914504  0.455013  2.009840   
MT-RNR2            143254.671256        0.434606  0.553576  0.785089   
HBB                127090.703257        2.661912  1.449787  1.836071   
...                          ...             ...       ...       ...   
ENSG00000273937.1       0.000000             NaN       NaN       NaN   
ENSG00000278633.1       0.000000             NaN       NaN       NaN   
ENSG00000278066.1       0.000000             NaN       NaN       NaN   
ENSG00000277374.1       0.000000             NaN       NaN       NaN   
ENSG00000275661.1       0.000000             NaN       NaN       NaN   

                     pvalue      padj  
gene_name                              
B2M                0.876422  0.996897  
S100A9             0.127725  0.669610  
S100A8             0.044448  0.402458  
MT-RNR2            0.432402  0.948016  
HBB                0.066347  0.493618  
...                     ...       ...  
ENSG00000273937.1       NaN       NaN  
ENSG00000278633.1       NaN       NaN  
ENSG00000278066.1       NaN       NaN  
ENSG00000277374.1       NaN       NaN  
ENSG00000275661.1       NaN       NaN  

[62703 rows x 6 columns]
✅ Differential expression analysis completed.

步骤11 — 过滤低表达基因#

过滤可以去除信号不足的基因,提高可解释性和多重检验行为。

这里我们在计算DE后应用BaseMean阈值。

print(result.shape)
result=result.loc[result['log2(BaseMean)']>1]
print(result.shape)
(21002, 14)
(21002, 14)

步骤12 — 设置DEG阈值#

我们设置显著性和倍数变化的阈值,这些阈值控制DEG的判定和下游可视化。

# -1 means automatically calculates
dds.foldchange_set(fc_threshold=-1,
                   pval_threshold=0.05,
                   logp_max=10)
... Fold change threshold: 2.9757964388302165

步骤13 — 可视化DEG(火山图)#

火山图汇总了效应大小(log2FC)与显著性之间的关系。

提示:识别DEG后,典型的后续步骤包括:

  • 基因集富集分析(GO/KEGG/Reactome)

  • 通路可视化

  • PPI网络分析

(根据您的教程范围,可以将这些下游步骤作为附加部分添加。)

dds.plot_volcano(title='DEG Analysis',figsize=(4,4),
                 plot_genes_num=8,plot_genes_fontsize=12,)
🌋 Volcano Plot Analysis:
   Total genes: 62703
   ↗️  Upregulated genes: 56
   ↘️  Downregulated genes: 39
   ➡️  Non-significant genes: 62608
   🎯 Total significant genes: 95
   log2FC range: -10.26 to 10.10
   qvalue range: 5.94e-14 to 1.00e+00

⚙️  Current Function Parameters:
   Data columns: pval_name='qvalue', fc_name='log2FC'
   Thresholds: pval_threshold=0.05, fc_max=2.9757964388302165, fc_min=-2.9757964388302165
   Plot size: figsize=(4, 4)
   Gene labels: plot_genes_num=8, plot_genes_fontsize=12
   Custom genes: None (auto-select top genes)

💡 Parameter Optimization Suggestions:
   ✅ Current parameters are optimal for your data!
────────────────────────────────────────────────────────────
<Axes: title={'center': 'DEG Analysis'}, xlabel='$log_{2}FC$', ylabel='$-log_{10}(qvalue)$'>
../_images/9fe030c172db1ef624bdb6a5fcc6afef5da84ae012261d764142e899d938fa52.png

扩展性与可重现性注意事项#

  • CPU和内存: kb-python通常比全基因组比对更轻量,但索引构建和定量仍受益于多线程和快速磁盘。

  • 参考一致性: 始终保持FASTA/GTF版本一致(并在教程中记录版本信息)。

  • 项目结构: 考虑将输出组织为results/<project>/<sample>/...,并保存描述组/重复的metadata.csv

如有需要,我可以按照相同的OmicVerse教程风格,通过添加新的Markdown + 新的代码单元格来扩展本Notebook,加入富集分析和PPI分析内容(不触及现有单元格)。