单细胞RNA-seq数据的比对和分析

单细胞RNA-seq数据的比对和分析#

kallisto、bustools和kb-python程序是用于执行此分析的免费开源软件工具,它们可以从原始测序读数中产生基因表达定量。在本教程中,我们使用kallisto和bustools预处理了来自10X Genomics的pbmc_1k v3数据集,然后执行了基本分析。

我们在OmicVerse中改进了kallisto、bustools和kb-python程序的集成:

  • 更加用户友好的函数实现:我们自动将它们封装到omicverse.alignment类中。

如果您发现本教程有帮助,请引用kb-python和OmicVerse:

import omicverse as ovimport scanpy as scimport pandas as pdimport numpy as npov.plot_set(font_path='Arial')
🔬 Starting plot initialization...
Using already downloaded Arial font from: /tmp/omicverse_arial.ttf
Registered as: Arial
🧬 Detecting GPU devices…
✅ NVIDIA CUDA GPUs detected: 8
    • [CUDA 0] NVIDIA GeForce RTX 4090 D
      Memory: 23.6 GB | Compute: 8.9
    • [CUDA 1] NVIDIA GeForce RTX 4090 D
      Memory: 23.6 GB | Compute: 8.9
    • [CUDA 2] NVIDIA GeForce RTX 4090 D
      Memory: 23.6 GB | Compute: 8.9
    • [CUDA 3] NVIDIA GeForce RTX 4090 D
      Memory: 23.6 GB | Compute: 8.9
    • [CUDA 4] NVIDIA GeForce RTX 4090 D
      Memory: 23.6 GB | Compute: 8.9
    • [CUDA 5] NVIDIA GeForce RTX 4090 D
      Memory: 23.6 GB | Compute: 8.9
    • [CUDA 6] NVIDIA GeForce RTX 4090 D
      Memory: 23.6 GB | Compute: 8.9
    • [CUDA 7] NVIDIA GeForce RTX 4090 D
      Memory: 23.6 GB | Compute: 8.9
✅ plot_set complete.

下载人类参考文件并构建索引#

我们从Ensembl提供的人类基因组和注释中构建人类cDNA和内含子索引。

%%time!wget -P pbmc_1k_v3 ftp://ftp.ensembl.org/pub/release-108/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz!wget -P pbmc_1k_v3 ftp://ftp.ensembl.org/pub/release-108/gtf/homo_sapiens/Homo_sapiens.GRCh38.108.gtf.gz
--2025-10-23 10:16:40--  ftp://ftp.ensembl.org/pub/release-108/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
           => ‘pbmc_1k_v3/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz’
Resolving ftp.ensembl.org (ftp.ensembl.org)... 193.62.193.169
Connecting to ftp.ensembl.org (ftp.ensembl.org)|193.62.193.169|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /pub/release-108/fasta/homo_sapiens/dna ... done.
==> SIZE Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz ... 881211416
==> PASV ... done.    ==> RETR Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz ... done.
Length: 881211416 (840M) (unauthoritative)

Homo_sapiens.GRCh38 100%[===================>] 840.39M   360KB/s    in 13m 48s 

2025-10-23 10:30:32 (1.02 MB/s) - ‘pbmc_1k_v3/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz’ saved [881211416]

--2025-10-23 10:30:32--  ftp://ftp.ensembl.org/pub/release-108/gtf/homo_sapiens/Homo_sapiens.GRCh38.108.gtf.gz
           => ‘pbmc_1k_v3/Homo_sapiens.GRCh38.108.gtf.gz’
Resolving ftp.ensembl.org (ftp.ensembl.org)... 193.62.193.169
Connecting to ftp.ensembl.org (ftp.ensembl.org)|193.62.193.169|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /pub/release-108/gtf/homo_sapiens ... done.
==> SIZE Homo_sapiens.GRCh38.108.gtf.gz ... 54107597
==> PASV ... done.    ==> RETR Homo_sapiens.GRCh38.108.gtf.gz ... done.
Length: 54107597 (52M) (unauthoritative)

Homo_sapiens.GRCh38 100%[===================>]  51.60M  1.26MB/s    in 46s     

2025-10-23 10:31:22 (1.12 MB/s) - ‘pbmc_1k_v3/Homo_sapiens.GRCh38.108.gtf.gz’ saved [54107597]

CPU times: user 16.4 s, sys: 4.12 s, total: 20.5 s
Wall time: 14min 42s
# 输入文件路径
# 生成对齐索引
result = ov.alignment.single.ref(  
    fasta_paths='pbmc_1k_v3/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz',  # 输入  
    gtf_paths='pbmc_1k_v3/Homo_sapiens.GRCh38.108.gtf.gz', # 输入  
    index_path='pbmc_1k_v3/index.idx', # 输出  
    t2g_path='pbmc_1k_v3/t2g.txt', # 输出  
    cdna_path='pbmc_1k_v3/cdna.fa', # 输出  
)
print(result.keys())
🚀 Starting ref workflow: standard
>> /opt/miniforge/envs/omicverse_working/bin/kb ref -i pbmc_1k_v3/index.idx -g pbmc_1k_v3/t2g.txt -t 8 --d-list-overhang 1 -f1 pbmc_1k_v3/cdna.fa pbmc_1k_v3/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz pbmc_1k_v3/Homo_sapiens.GRCh38.108.gtf.gz
[2025-10-29 06:28:04,671]    INFO [ref] Preparing pbmc_1k_v3/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz, pbmc_1k_v3/Homo_sapiens.GRCh38.108.gtf.gz
[2025-10-29 06:29:09,670]    INFO [ref] Splitting genome pbmc_1k_v3/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz into cDNA at /data/hulei/Projects/Omicverse_2/omicverse_update/tmp/tmpggg2bozs
[2025-10-29 06:30:43,695]    INFO [ref] Concatenating 1 cDNAs to pbmc_1k_v3/cdna.fa
[2025-10-29 06:30:48,774]    INFO [ref] Creating transcript-to-gene mapping at pbmc_1k_v3/t2g.txt
[2025-10-29 06:30:51,120]    INFO [ref] Indexing pbmc_1k_v3/cdna.fa to pbmc_1k_v3/index.idx
✓ ref workflow completed!
dict_keys(['workflow', 'technology', 'parameters', 'index_path', 't2g_path', 'cdna_path'])

生成RNA计数矩阵#

以下命令将生成H5AD格式的RNA计数矩阵(行表示细胞,列表示基因),这是用于存储Anndata对象的二进制格式。注意,这需要向index_patht2g_path参数分别提供在前一步骤中下载的索引和转录本到基因的映射。此外,由于读取是使用10x Genomics Chromium单细胞v3化学试剂生成的,因此使用了10xv3技术参数。

# 生成RNA计数矩阵
result = ov.alignment.single.count(    
    fastq_paths=['pbmc_1k_v3/pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L001_R1_001.fastq.gz', # 输入                
                 'pbmc_1k_v3/pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L001_R2_001.fastq.gz', # 输入                 
                 'pbmc_1k_v3/pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L002_R1_001.fastq.gz', # 输入                 
                 'pbmc_1k_v3/pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L002_R2_001.fastq.gz'], # 输入    
    index_path='pbmc_1k_v3/index.idx', # 输入    
    t2g_path='pbmc_1k_v3/t2g.txt', # 输入    
    technology='10XV3', # 技术    
    output_path='pbmc_1k_v3', # 输出    
    h5ad=True,    
    filter_barcodes=True,    
    threads=2
)
print(result.keys())
🚀 Starting count workflow: standard
    Technology: 10XV3
    Output directory: pbmc_1k_v3
>> /opt/miniforge/envs/omicverse_working/bin/kb count -i pbmc_1k_v3/index.idx -g pbmc_1k_v3/t2g.txt -x 10XV3 -o pbmc_1k_v3 -t 2 -m 2G --filter bustools --h5ad pbmc_1k_v3/pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L001_R1_001.fastq.gz pbmc_1k_v3/pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L001_R2_001.fastq.gz pbmc_1k_v3/pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L002_R1_001.fastq.gz pbmc_1k_v3/pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L002_R2_001.fastq.gz
[2025-10-29 06:39:41,021]    INFO [count] Using index pbmc_1k_v3/index.idx to generate BUS file to pbmc_1k_v3 from
[2025-10-29 06:39:41,022]    INFO [count]         pbmc_1k_v3/pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L001_R1_001.fastq.gz
[2025-10-29 06:39:41,022]    INFO [count]         pbmc_1k_v3/pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L001_R2_001.fastq.gz
[2025-10-29 06:39:41,022]    INFO [count]         pbmc_1k_v3/pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L002_R1_001.fastq.gz
[2025-10-29 06:39:41,022]    INFO [count]         pbmc_1k_v3/pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L002_R2_001.fastq.gz
[2025-10-29 06:54:09,339]    INFO [count] Sorting BUS file pbmc_1k_v3/output.bus to pbmc_1k_v3/tmp/output.s.bus
[2025-10-29 06:54:20,157]    INFO [count] On-list not provided
[2025-10-29 06:54:20,158]    INFO [count] Copying pre-packaged 10XV3 on-list to pbmc_1k_v3
[2025-10-29 06:54:21,859]    INFO [count] Inspecting BUS file pbmc_1k_v3/tmp/output.s.bus
[2025-10-29 06:54:29,973]    INFO [count] Correcting BUS records in pbmc_1k_v3/tmp/output.s.bus to pbmc_1k_v3/tmp/output.s.c.bus with on-list pbmc_1k_v3/10x_version3_whitelist.txt
[2025-10-29 06:54:44,096]    INFO [count] Sorting BUS file pbmc_1k_v3/tmp/output.s.c.bus to pbmc_1k_v3/output.unfiltered.bus
[2025-10-29 06:54:52,412]    INFO [count] Generating count matrix pbmc_1k_v3/counts_unfiltered/cells_x_genes from BUS file pbmc_1k_v3/output.unfiltered.bus
[2025-10-29 06:55:00,914]    INFO [count] Writing gene names to file pbmc_1k_v3/counts_unfiltered/cells_x_genes.genes.names.txt
[2025-10-29 06:55:01,271] WARNING [count] 22736 gene IDs do not have corresponding valid gene names. These genes will use their gene IDs instead.
[2025-10-29 06:55:01,314]    INFO [count] Reading matrix pbmc_1k_v3/counts_unfiltered/cells_x_genes.mtx
[2025-10-29 06:55:05,177]    INFO [count] Writing matrix to h5ad pbmc_1k_v3/counts_unfiltered/adata.h5ad
[2025-10-29 06:55:05,880]    INFO [count] Filtering with bustools
[2025-10-29 06:55:05,880]    INFO [count] Generating on-list pbmc_1k_v3/filter_barcodes.txt from BUS file pbmc_1k_v3/output.unfiltered.bus
[2025-10-29 06:55:07,084]    INFO [count] Correcting BUS records in pbmc_1k_v3/output.unfiltered.bus to pbmc_1k_v3/tmp/output.unfiltered.c.bus with on-list pbmc_1k_v3/filter_barcodes.txt
[2025-10-29 06:55:14,096]    INFO [count] Sorting BUS file pbmc_1k_v3/tmp/output.unfiltered.c.bus to pbmc_1k_v3/output.filtered.bus
[2025-10-29 06:55:21,810]    INFO [count] Generating count matrix pbmc_1k_v3/counts_filtered/cells_x_genes from BUS file pbmc_1k_v3/output.filtered.bus
[2025-10-29 06:55:28,532]    INFO [count] Writing gene names to file pbmc_1k_v3/counts_filtered/cells_x_genes.genes.names.txt
[2025-10-29 06:55:28,914] WARNING [count] 22736 gene IDs do not have corresponding valid gene names. These genes will use their gene IDs instead.
[2025-10-29 06:55:28,959]    INFO [count] Reading matrix pbmc_1k_v3/counts_filtered/cells_x_genes.mtx
[2025-10-29 06:55:31,769]    INFO [count] Writing matrix to h5ad pbmc_1k_v3/counts_filtered/adata.h5ad
✓ count workflow completed!
dict_keys(['workflow', 'technology', 'output_path', 'parameters'])

分析#

在本教程的这一部分中,我们将把kb count生成的RNA计数矩阵加载到Python中,并使用omicverse管道对其进行分析。

您可以在此网站上找到有关这些代码的更多详细信息。

adata = sc.read_h5ad("pbmc_1k_v3/counts_filtered/adata.h5ad")var_names = pd.read_csv("pbmc_1k_v3/counts_filtered/cells_x_genes.genes.names.txt",                        index_col=[0],header=None)adata.var_names = var_names.index.tolist()adata
AnnData object with n_obs × n_vars = 1194 × 62703
%%timeadata=ov.pp.qc(adata,              tresh={'mito_perc': 0.2, 'nUMIs': 500, 'detected_genes': 250},               doublets_method='scrublet',              batch_key=None)adata
🖥️ Using CPU mode for QC...

📊 Step 1: Calculating QC Metrics
   Mitochondrial genes (prefix 'MT-'): 37 found
   ✓ QC metrics calculated:
     • Mean nUMIs: 8081 (range: 845-59145)
     • Mean genes: 2222 (range: 21-6863)
     • Mean mitochondrial %: 13.7% (max: 98.6%)
   📈 Original cell count: 1,194

🔧 Step 2: Quality Filtering (SEURAT)
   Thresholds: mito≤0.2, nUMIs≥500, genes≥250
   📊 Seurat Filter Results:
     • nUMIs filter (≥500): 0 cells failed (0.0%)
     • Genes filter (≥250): 15 cells failed (1.3%)
     • Mitochondrial filter (≤0.2): 89 cells failed (7.5%)
   ✓ Filters applied successfully
   ✓ Combined QC filters: 89 cells removed (7.5%)

🎯 Step 3: Final Filtering
   Parameters: min_genes=200, min_cells=3
   Ratios: max_genes_ratio=1, max_cells_ratio=1
filtered out 43156 genes that are detected in less than 3 cells
   ✓ Final filtering: 0 cells, 43,156 genes removed

🔍 Step 4: Doublet Detection
   ⚠️  Note: 'scrublet' detection is too old and may not work properly
   💡 Consider using 'doublets_method=sccomposite' for better results
   🔍 Running scrublet doublet detection...

🔍 Running Scrublet Doublet Detection:
   Mode: cpu
   Computing doublet prediction using Scrublet algorithm
   🔍 Filtering genes and cells...
   🔍 Normalizing data and selecting highly variable genes...

🔍 Count Normalization:
   Target sum: median
   Exclude highly expressed: False

✅ Count Normalization Completed Successfully!
   ✓ Processed: 1,105 cells × 19,547 genes
   ✓ Runtime: 0.00s

🔍 Highly Variable Genes Selection:
   Method: seurat
   ⚠️ Gene indices [3676] fell into a single bin: normalized dispersion set to 1
   💡 Consider decreasing `n_bins` to avoid this effect

✅ HVG Selection Completed Successfully!
   ✓ Selected: 2,865 highly variable genes out of 19,547 total (14.7%)
   ✓ Results added to AnnData object:
     • 'highly_variable': Boolean vector (adata.var)
     • 'means': Float vector (adata.var)
     • 'dispersions': Float vector (adata.var)
     • 'dispersions_norm': Float vector (adata.var)
   🔍 Simulating synthetic doublets...
   🔍 Normalizing observed and simulated data...

🔍 Count Normalization:
   Target sum: 1000000.0
   Exclude highly expressed: False

✅ Count Normalization Completed Successfully!
   ✓ Processed: 1,105 cells × 2,865 genes
   ✓ Runtime: 0.00s

🔍 Count Normalization:
   Target sum: 1000000.0
   Exclude highly expressed: False

✅ Count Normalization Completed Successfully!
   ✓ Processed: 2,210 cells × 2,865 genes
   ✓ Runtime: 0.01s
   🔍 Embedding transcriptomes using PCA...
   🔍 Calculating doublet scores...
    using data matrix X directly
   🔍 Calling doublets with threshold detection...
   📊 Automatic threshold: 0.192
   📈 Detected doublet rate: 1.0%
   🔍 Detectable doublet fraction: 51.4%
   📊 Overall doublet rate comparison:
     • Expected: 5.0%
     • Estimated: 1.9%

✅ Scrublet Analysis Completed Successfully!
   ✓ Results added to AnnData object:
     • 'doublet_score': Doublet scores (adata.obs)
     • 'predicted_doublet': Boolean predictions (adata.obs)
     • 'scrublet': Parameters and metadata (adata.uns)
   ✓ Scrublet completed: 11 doublets removed (1.0%)
CPU times: user 22.8 s, sys: 16.7 s, total: 39.5 s
Wall time: 3.98 s
AnnData object with n_obs × n_vars = 1094 × 19547
    obs: 'nUMIs', 'mito_perc', 'detected_genes', 'cell_complexity', 'passing_mt', 'passing_nUMIs', 'passing_ngenes', 'n_genes', 'doublet_score', 'predicted_doublet'
    var: 'mt', 'n_cells'
    uns: 'scrublet', 'status', 'status_args', 'REFERENCE_MANU'
%%timeadata=ov.pp.preprocess(adata,mode='shiftlog|pearson',n_HVGs=2000,                       target_sum=50*1e4)adata
🔍 [2025-10-29 06:55:37] Running preprocessing in 'cpu' mode...
Begin robust gene identification
    After filtration, 19547/19547 genes are kept.
    Among 19547 genes, 19547 genes are robust.
✅ Robust gene identification completed successfully.
Begin size normalization: shiftlog and HVGs selection pearson

🔍 Count Normalization:
   Target sum: 500000.0
   Exclude highly expressed: True
   Max fraction threshold: 0.2
   ⚠️ Excluding 1 highly-expressed genes from normalization computation
   Excluded genes: ['IGKC']

✅ Count Normalization Completed Successfully!
   ✓ Processed: 1,094 cells × 19,547 genes
   ✓ Runtime: 0.16s

🔍 Highly Variable Genes Selection (Experimental):
   Method: pearson_residuals
   Target genes: 2,000
   Theta (overdispersion): 100

✅ Experimental HVG Selection Completed Successfully!
   ✓ Selected: 2,000 highly variable genes out of 19,547 total (10.2%)
   ✓ Results added to AnnData object:
     • 'highly_variable': Boolean vector (adata.var)
     • 'highly_variable_rank': Float vector (adata.var)
     • 'highly_variable_nbatches': Int vector (adata.var)
     • 'highly_variable_intersection': Boolean vector (adata.var)
     • 'means': Float vector (adata.var)
     • 'variances': Float vector (adata.var)
     • 'residual_variances': Float vector (adata.var)
    Time to analyze data in cpu: 2.02 seconds.
✅ Preprocessing completed successfully.
    Added:
        'highly_variable_features', boolean vector (adata.var)
        'means', float vector (adata.var)
        'variances', float vector (adata.var)
        'residual_variances', float vector (adata.var)
        'counts', raw counts layer (adata.layers)
    End of size normalization: shiftlog and HVGs selection pearson
CPU times: user 4.49 s, sys: 6.26 s, total: 10.8 s
Wall time: 2.08 s
AnnData object with n_obs × n_vars = 1094 × 19547
    obs: 'nUMIs', 'mito_perc', 'detected_genes', 'cell_complexity', 'passing_mt', 'passing_nUMIs', 'passing_ngenes', 'n_genes', 'doublet_score', 'predicted_doublet'
    var: 'mt', 'n_cells', 'percent_cells', 'robust', 'means', 'variances', 'residual_variances', 'highly_variable_rank', 'highly_variable_features'
    uns: 'scrublet', 'status', 'status_args', 'REFERENCE_MANU', 'log1p', 'hvg'
    layers: 'counts'
%%timeadata.raw = adataadata = adata[:, adata.var.highly_variable_features]adata
CPU times: user 4.04 ms, sys: 4.04 ms, total: 8.07 ms
Wall time: 7.41 ms
View of AnnData object with n_obs × n_vars = 1094 × 2000
    obs: 'nUMIs', 'mito_perc', 'detected_genes', 'cell_complexity', 'passing_mt', 'passing_nUMIs', 'passing_ngenes', 'n_genes', 'doublet_score', 'predicted_doublet'
    var: 'mt', 'n_cells', 'percent_cells', 'robust', 'means', 'variances', 'residual_variances', 'highly_variable_rank', 'highly_variable_features'
    uns: 'scrublet', 'status', 'status_args', 'REFERENCE_MANU', 'log1p', 'hvg'
    layers: 'counts'
%%timeov.pp.scale(adata)adata
CPU times: user 1.46 s, sys: 4.36 s, total: 5.82 s
Wall time: 715 ms
AnnData object with n_obs × n_vars = 1094 × 2000
    obs: 'nUMIs', 'mito_perc', 'detected_genes', 'cell_complexity', 'passing_mt', 'passing_nUMIs', 'passing_ngenes', 'n_genes', 'doublet_score', 'predicted_doublet'
    var: 'mt', 'n_cells', 'percent_cells', 'robust', 'means', 'variances', 'residual_variances', 'highly_variable_rank', 'highly_variable_features'
    uns: 'scrublet', 'status', 'status_args', 'REFERENCE_MANU', 'log1p', 'hvg'
    layers: 'counts', 'scaled'
%%timeov.pp.pca(adata,layer='scaled',n_pcs=50)adata
computing PCA🔍
    with n_comps=50
   🖥️ Using sklearn PCA for CPU computation
   🖥️ sklearn PCA backend: CPU computation
    finished✅ (0:00:00)
CPU times: user 5.97 s, sys: 23.4 s, total: 29.3 s
Wall time: 593 ms
AnnData object with n_obs × n_vars = 1094 × 2000
    obs: 'nUMIs', 'mito_perc', 'detected_genes', 'cell_complexity', 'passing_mt', 'passing_nUMIs', 'passing_ngenes', 'n_genes', 'doublet_score', 'predicted_doublet'
    var: 'mt', 'n_cells', 'percent_cells', 'robust', 'means', 'variances', 'residual_variances', 'highly_variable_rank', 'highly_variable_features'
    uns: 'scrublet', 'status', 'status_args', 'REFERENCE_MANU', 'log1p', 'hvg', 'pca', 'scaled|original|pca_var_ratios', 'scaled|original|cum_sum_eigenvalues'
    obsm: 'X_pca', 'scaled|original|X_pca'
    varm: 'PCs', 'scaled|original|pca_loadings'
    layers: 'counts', 'scaled'
adata.obsm['X_pca']=adata.obsm['scaled|original|X_pca']ov.pl.embedding(adata,                  basis='X_pca',                  color='CST3',                  frameon='small')