单细胞RNA-seq数据的比对和分析#
kallisto、bustools和kb-python程序是用于执行此分析的免费开源软件工具,它们可以从原始测序读数中产生基因表达定量。在本教程中,我们使用kallisto和bustools预处理了来自10X Genomics的pbmc_1k v3数据集,然后执行了基本分析。
我们在OmicVerse中改进了kallisto、bustools和kb-python程序的集成:
更加用户友好的函数实现:我们自动将它们封装到
omicverse.alignment类中。
如果您发现本教程有帮助,请引用kb-python和OmicVerse:
Sullivan, D.K., Min, K.H.(., Hjörleifsson, K.E. et al. kallisto, bustools and kb-python for quantifying bulk, single-cell and single-nucleus RNA-seq. Nature Protocol (2025).https://doi.org/10.1038/s41596-024-01057-0
Melsted, P., Booeshaghi, A.S., Liu, L. et al. Modular, efficient and constant-memory single-cell RNA-seq preprocessing. Nature Biotechnology (2021). https://doi.org/10.1038/s41587-021-00870-2
import omicverse as ovimport scanpy as scimport pandas as pdimport numpy as npov.plot_set(font_path='Arial')
🔬 Starting plot initialization...
Using already downloaded Arial font from: /tmp/omicverse_arial.ttf
Registered as: Arial
🧬 Detecting GPU devices…
✅ NVIDIA CUDA GPUs detected: 8
• [CUDA 0] NVIDIA GeForce RTX 4090 D
Memory: 23.6 GB | Compute: 8.9
• [CUDA 1] NVIDIA GeForce RTX 4090 D
Memory: 23.6 GB | Compute: 8.9
• [CUDA 2] NVIDIA GeForce RTX 4090 D
Memory: 23.6 GB | Compute: 8.9
• [CUDA 3] NVIDIA GeForce RTX 4090 D
Memory: 23.6 GB | Compute: 8.9
• [CUDA 4] NVIDIA GeForce RTX 4090 D
Memory: 23.6 GB | Compute: 8.9
• [CUDA 5] NVIDIA GeForce RTX 4090 D
Memory: 23.6 GB | Compute: 8.9
• [CUDA 6] NVIDIA GeForce RTX 4090 D
Memory: 23.6 GB | Compute: 8.9
• [CUDA 7] NVIDIA GeForce RTX 4090 D
Memory: 23.6 GB | Compute: 8.9
✅ plot_set complete.
下载人类参考文件并构建索引#
我们从Ensembl提供的人类基因组和注释中构建人类cDNA和内含子索引。
%%time!wget -P pbmc_1k_v3 ftp://ftp.ensembl.org/pub/release-108/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz!wget -P pbmc_1k_v3 ftp://ftp.ensembl.org/pub/release-108/gtf/homo_sapiens/Homo_sapiens.GRCh38.108.gtf.gz
--2025-10-23 10:16:40-- ftp://ftp.ensembl.org/pub/release-108/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
=> ‘pbmc_1k_v3/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz’
Resolving ftp.ensembl.org (ftp.ensembl.org)... 193.62.193.169
Connecting to ftp.ensembl.org (ftp.ensembl.org)|193.62.193.169|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done. ==> PWD ... done.
==> TYPE I ... done. ==> CWD (1) /pub/release-108/fasta/homo_sapiens/dna ... done.
==> SIZE Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz ... 881211416
==> PASV ... done. ==> RETR Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz ... done.
Length: 881211416 (840M) (unauthoritative)
Homo_sapiens.GRCh38 100%[===================>] 840.39M 360KB/s in 13m 48s
2025-10-23 10:30:32 (1.02 MB/s) - ‘pbmc_1k_v3/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz’ saved [881211416]
--2025-10-23 10:30:32-- ftp://ftp.ensembl.org/pub/release-108/gtf/homo_sapiens/Homo_sapiens.GRCh38.108.gtf.gz
=> ‘pbmc_1k_v3/Homo_sapiens.GRCh38.108.gtf.gz’
Resolving ftp.ensembl.org (ftp.ensembl.org)... 193.62.193.169
Connecting to ftp.ensembl.org (ftp.ensembl.org)|193.62.193.169|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done. ==> PWD ... done.
==> TYPE I ... done. ==> CWD (1) /pub/release-108/gtf/homo_sapiens ... done.
==> SIZE Homo_sapiens.GRCh38.108.gtf.gz ... 54107597
==> PASV ... done. ==> RETR Homo_sapiens.GRCh38.108.gtf.gz ... done.
Length: 54107597 (52M) (unauthoritative)
Homo_sapiens.GRCh38 100%[===================>] 51.60M 1.26MB/s in 46s
2025-10-23 10:31:22 (1.12 MB/s) - ‘pbmc_1k_v3/Homo_sapiens.GRCh38.108.gtf.gz’ saved [54107597]
CPU times: user 16.4 s, sys: 4.12 s, total: 20.5 s
Wall time: 14min 42s
# 输入文件路径
# 生成对齐索引
result = ov.alignment.single.ref(
fasta_paths='pbmc_1k_v3/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz', # 输入
gtf_paths='pbmc_1k_v3/Homo_sapiens.GRCh38.108.gtf.gz', # 输入
index_path='pbmc_1k_v3/index.idx', # 输出
t2g_path='pbmc_1k_v3/t2g.txt', # 输出
cdna_path='pbmc_1k_v3/cdna.fa', # 输出
)
print(result.keys())
🚀 Starting ref workflow: standard
>> /opt/miniforge/envs/omicverse_working/bin/kb ref -i pbmc_1k_v3/index.idx -g pbmc_1k_v3/t2g.txt -t 8 --d-list-overhang 1 -f1 pbmc_1k_v3/cdna.fa pbmc_1k_v3/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz pbmc_1k_v3/Homo_sapiens.GRCh38.108.gtf.gz
[2025-10-29 06:28:04,671] INFO [ref] Preparing pbmc_1k_v3/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz, pbmc_1k_v3/Homo_sapiens.GRCh38.108.gtf.gz
[2025-10-29 06:29:09,670] INFO [ref] Splitting genome pbmc_1k_v3/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz into cDNA at /data/hulei/Projects/Omicverse_2/omicverse_update/tmp/tmpggg2bozs
[2025-10-29 06:30:43,695] INFO [ref] Concatenating 1 cDNAs to pbmc_1k_v3/cdna.fa
[2025-10-29 06:30:48,774] INFO [ref] Creating transcript-to-gene mapping at pbmc_1k_v3/t2g.txt
[2025-10-29 06:30:51,120] INFO [ref] Indexing pbmc_1k_v3/cdna.fa to pbmc_1k_v3/index.idx
✓ ref workflow completed!
dict_keys(['workflow', 'technology', 'parameters', 'index_path', 't2g_path', 'cdna_path'])
生成RNA计数矩阵#
以下命令将生成H5AD格式的RNA计数矩阵(行表示细胞,列表示基因),这是用于存储Anndata对象的二进制格式。注意,这需要向index_path和t2g_path参数分别提供在前一步骤中下载的索引和转录本到基因的映射。此外,由于读取是使用10x Genomics Chromium单细胞v3化学试剂生成的,因此使用了10xv3技术参数。
# 生成RNA计数矩阵
result = ov.alignment.single.count(
fastq_paths=['pbmc_1k_v3/pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L001_R1_001.fastq.gz', # 输入
'pbmc_1k_v3/pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L001_R2_001.fastq.gz', # 输入
'pbmc_1k_v3/pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L002_R1_001.fastq.gz', # 输入
'pbmc_1k_v3/pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L002_R2_001.fastq.gz'], # 输入
index_path='pbmc_1k_v3/index.idx', # 输入
t2g_path='pbmc_1k_v3/t2g.txt', # 输入
technology='10XV3', # 技术
output_path='pbmc_1k_v3', # 输出
h5ad=True,
filter_barcodes=True,
threads=2
)
print(result.keys())
🚀 Starting count workflow: standard
Technology: 10XV3
Output directory: pbmc_1k_v3
>> /opt/miniforge/envs/omicverse_working/bin/kb count -i pbmc_1k_v3/index.idx -g pbmc_1k_v3/t2g.txt -x 10XV3 -o pbmc_1k_v3 -t 2 -m 2G --filter bustools --h5ad pbmc_1k_v3/pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L001_R1_001.fastq.gz pbmc_1k_v3/pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L001_R2_001.fastq.gz pbmc_1k_v3/pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L002_R1_001.fastq.gz pbmc_1k_v3/pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L002_R2_001.fastq.gz
[2025-10-29 06:39:41,021] INFO [count] Using index pbmc_1k_v3/index.idx to generate BUS file to pbmc_1k_v3 from
[2025-10-29 06:39:41,022] INFO [count] pbmc_1k_v3/pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L001_R1_001.fastq.gz
[2025-10-29 06:39:41,022] INFO [count] pbmc_1k_v3/pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L001_R2_001.fastq.gz
[2025-10-29 06:39:41,022] INFO [count] pbmc_1k_v3/pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L002_R1_001.fastq.gz
[2025-10-29 06:39:41,022] INFO [count] pbmc_1k_v3/pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L002_R2_001.fastq.gz
[2025-10-29 06:54:09,339] INFO [count] Sorting BUS file pbmc_1k_v3/output.bus to pbmc_1k_v3/tmp/output.s.bus
[2025-10-29 06:54:20,157] INFO [count] On-list not provided
[2025-10-29 06:54:20,158] INFO [count] Copying pre-packaged 10XV3 on-list to pbmc_1k_v3
[2025-10-29 06:54:21,859] INFO [count] Inspecting BUS file pbmc_1k_v3/tmp/output.s.bus
[2025-10-29 06:54:29,973] INFO [count] Correcting BUS records in pbmc_1k_v3/tmp/output.s.bus to pbmc_1k_v3/tmp/output.s.c.bus with on-list pbmc_1k_v3/10x_version3_whitelist.txt
[2025-10-29 06:54:44,096] INFO [count] Sorting BUS file pbmc_1k_v3/tmp/output.s.c.bus to pbmc_1k_v3/output.unfiltered.bus
[2025-10-29 06:54:52,412] INFO [count] Generating count matrix pbmc_1k_v3/counts_unfiltered/cells_x_genes from BUS file pbmc_1k_v3/output.unfiltered.bus
[2025-10-29 06:55:00,914] INFO [count] Writing gene names to file pbmc_1k_v3/counts_unfiltered/cells_x_genes.genes.names.txt
[2025-10-29 06:55:01,271] WARNING [count] 22736 gene IDs do not have corresponding valid gene names. These genes will use their gene IDs instead.
[2025-10-29 06:55:01,314] INFO [count] Reading matrix pbmc_1k_v3/counts_unfiltered/cells_x_genes.mtx
[2025-10-29 06:55:05,177] INFO [count] Writing matrix to h5ad pbmc_1k_v3/counts_unfiltered/adata.h5ad
[2025-10-29 06:55:05,880] INFO [count] Filtering with bustools
[2025-10-29 06:55:05,880] INFO [count] Generating on-list pbmc_1k_v3/filter_barcodes.txt from BUS file pbmc_1k_v3/output.unfiltered.bus
[2025-10-29 06:55:07,084] INFO [count] Correcting BUS records in pbmc_1k_v3/output.unfiltered.bus to pbmc_1k_v3/tmp/output.unfiltered.c.bus with on-list pbmc_1k_v3/filter_barcodes.txt
[2025-10-29 06:55:14,096] INFO [count] Sorting BUS file pbmc_1k_v3/tmp/output.unfiltered.c.bus to pbmc_1k_v3/output.filtered.bus
[2025-10-29 06:55:21,810] INFO [count] Generating count matrix pbmc_1k_v3/counts_filtered/cells_x_genes from BUS file pbmc_1k_v3/output.filtered.bus
[2025-10-29 06:55:28,532] INFO [count] Writing gene names to file pbmc_1k_v3/counts_filtered/cells_x_genes.genes.names.txt
[2025-10-29 06:55:28,914] WARNING [count] 22736 gene IDs do not have corresponding valid gene names. These genes will use their gene IDs instead.
[2025-10-29 06:55:28,959] INFO [count] Reading matrix pbmc_1k_v3/counts_filtered/cells_x_genes.mtx
[2025-10-29 06:55:31,769] INFO [count] Writing matrix to h5ad pbmc_1k_v3/counts_filtered/adata.h5ad
✓ count workflow completed!
dict_keys(['workflow', 'technology', 'output_path', 'parameters'])
分析#
在本教程的这一部分中,我们将把kb count生成的RNA计数矩阵加载到Python中,并使用omicverse管道对其进行分析。
您可以在此网站上找到有关这些代码的更多详细信息。
adata = sc.read_h5ad("pbmc_1k_v3/counts_filtered/adata.h5ad")var_names = pd.read_csv("pbmc_1k_v3/counts_filtered/cells_x_genes.genes.names.txt", index_col=[0],header=None)adata.var_names = var_names.index.tolist()adata
AnnData object with n_obs × n_vars = 1194 × 62703
%%timeadata=ov.pp.qc(adata, tresh={'mito_perc': 0.2, 'nUMIs': 500, 'detected_genes': 250}, doublets_method='scrublet', batch_key=None)adata
🖥️ Using CPU mode for QC...
📊 Step 1: Calculating QC Metrics
Mitochondrial genes (prefix 'MT-'): 37 found
✓ QC metrics calculated:
• Mean nUMIs: 8081 (range: 845-59145)
• Mean genes: 2222 (range: 21-6863)
• Mean mitochondrial %: 13.7% (max: 98.6%)
📈 Original cell count: 1,194
🔧 Step 2: Quality Filtering (SEURAT)
Thresholds: mito≤0.2, nUMIs≥500, genes≥250
📊 Seurat Filter Results:
• nUMIs filter (≥500): 0 cells failed (0.0%)
• Genes filter (≥250): 15 cells failed (1.3%)
• Mitochondrial filter (≤0.2): 89 cells failed (7.5%)
✓ Filters applied successfully
✓ Combined QC filters: 89 cells removed (7.5%)
🎯 Step 3: Final Filtering
Parameters: min_genes=200, min_cells=3
Ratios: max_genes_ratio=1, max_cells_ratio=1
filtered out 43156 genes that are detected in less than 3 cells
✓ Final filtering: 0 cells, 43,156 genes removed
🔍 Step 4: Doublet Detection
⚠️ Note: 'scrublet' detection is too old and may not work properly
💡 Consider using 'doublets_method=sccomposite' for better results
🔍 Running scrublet doublet detection...
🔍 Running Scrublet Doublet Detection:
Mode: cpu
Computing doublet prediction using Scrublet algorithm
🔍 Filtering genes and cells...
🔍 Normalizing data and selecting highly variable genes...
🔍 Count Normalization:
Target sum: median
Exclude highly expressed: False
✅ Count Normalization Completed Successfully!
✓ Processed: 1,105 cells × 19,547 genes
✓ Runtime: 0.00s
🔍 Highly Variable Genes Selection:
Method: seurat
⚠️ Gene indices [3676] fell into a single bin: normalized dispersion set to 1
💡 Consider decreasing `n_bins` to avoid this effect
✅ HVG Selection Completed Successfully!
✓ Selected: 2,865 highly variable genes out of 19,547 total (14.7%)
✓ Results added to AnnData object:
• 'highly_variable': Boolean vector (adata.var)
• 'means': Float vector (adata.var)
• 'dispersions': Float vector (adata.var)
• 'dispersions_norm': Float vector (adata.var)
🔍 Simulating synthetic doublets...
🔍 Normalizing observed and simulated data...
🔍 Count Normalization:
Target sum: 1000000.0
Exclude highly expressed: False
✅ Count Normalization Completed Successfully!
✓ Processed: 1,105 cells × 2,865 genes
✓ Runtime: 0.00s
🔍 Count Normalization:
Target sum: 1000000.0
Exclude highly expressed: False
✅ Count Normalization Completed Successfully!
✓ Processed: 2,210 cells × 2,865 genes
✓ Runtime: 0.01s
🔍 Embedding transcriptomes using PCA...
🔍 Calculating doublet scores...
using data matrix X directly
🔍 Calling doublets with threshold detection...
📊 Automatic threshold: 0.192
📈 Detected doublet rate: 1.0%
🔍 Detectable doublet fraction: 51.4%
📊 Overall doublet rate comparison:
• Expected: 5.0%
• Estimated: 1.9%
✅ Scrublet Analysis Completed Successfully!
✓ Results added to AnnData object:
• 'doublet_score': Doublet scores (adata.obs)
• 'predicted_doublet': Boolean predictions (adata.obs)
• 'scrublet': Parameters and metadata (adata.uns)
✓ Scrublet completed: 11 doublets removed (1.0%)
CPU times: user 22.8 s, sys: 16.7 s, total: 39.5 s
Wall time: 3.98 s
AnnData object with n_obs × n_vars = 1094 × 19547
obs: 'nUMIs', 'mito_perc', 'detected_genes', 'cell_complexity', 'passing_mt', 'passing_nUMIs', 'passing_ngenes', 'n_genes', 'doublet_score', 'predicted_doublet'
var: 'mt', 'n_cells'
uns: 'scrublet', 'status', 'status_args', 'REFERENCE_MANU'
%%timeadata=ov.pp.preprocess(adata,mode='shiftlog|pearson',n_HVGs=2000, target_sum=50*1e4)adata
🔍 [2025-10-29 06:55:37] Running preprocessing in 'cpu' mode...
Begin robust gene identification
After filtration, 19547/19547 genes are kept.
Among 19547 genes, 19547 genes are robust.
✅ Robust gene identification completed successfully.
Begin size normalization: shiftlog and HVGs selection pearson
🔍 Count Normalization:
Target sum: 500000.0
Exclude highly expressed: True
Max fraction threshold: 0.2
⚠️ Excluding 1 highly-expressed genes from normalization computation
Excluded genes: ['IGKC']
✅ Count Normalization Completed Successfully!
✓ Processed: 1,094 cells × 19,547 genes
✓ Runtime: 0.16s
🔍 Highly Variable Genes Selection (Experimental):
Method: pearson_residuals
Target genes: 2,000
Theta (overdispersion): 100
✅ Experimental HVG Selection Completed Successfully!
✓ Selected: 2,000 highly variable genes out of 19,547 total (10.2%)
✓ Results added to AnnData object:
• 'highly_variable': Boolean vector (adata.var)
• 'highly_variable_rank': Float vector (adata.var)
• 'highly_variable_nbatches': Int vector (adata.var)
• 'highly_variable_intersection': Boolean vector (adata.var)
• 'means': Float vector (adata.var)
• 'variances': Float vector (adata.var)
• 'residual_variances': Float vector (adata.var)
Time to analyze data in cpu: 2.02 seconds.
✅ Preprocessing completed successfully.
Added:
'highly_variable_features', boolean vector (adata.var)
'means', float vector (adata.var)
'variances', float vector (adata.var)
'residual_variances', float vector (adata.var)
'counts', raw counts layer (adata.layers)
End of size normalization: shiftlog and HVGs selection pearson
CPU times: user 4.49 s, sys: 6.26 s, total: 10.8 s
Wall time: 2.08 s
AnnData object with n_obs × n_vars = 1094 × 19547
obs: 'nUMIs', 'mito_perc', 'detected_genes', 'cell_complexity', 'passing_mt', 'passing_nUMIs', 'passing_ngenes', 'n_genes', 'doublet_score', 'predicted_doublet'
var: 'mt', 'n_cells', 'percent_cells', 'robust', 'means', 'variances', 'residual_variances', 'highly_variable_rank', 'highly_variable_features'
uns: 'scrublet', 'status', 'status_args', 'REFERENCE_MANU', 'log1p', 'hvg'
layers: 'counts'
%%timeadata.raw = adataadata = adata[:, adata.var.highly_variable_features]adata
CPU times: user 4.04 ms, sys: 4.04 ms, total: 8.07 ms
Wall time: 7.41 ms
View of AnnData object with n_obs × n_vars = 1094 × 2000
obs: 'nUMIs', 'mito_perc', 'detected_genes', 'cell_complexity', 'passing_mt', 'passing_nUMIs', 'passing_ngenes', 'n_genes', 'doublet_score', 'predicted_doublet'
var: 'mt', 'n_cells', 'percent_cells', 'robust', 'means', 'variances', 'residual_variances', 'highly_variable_rank', 'highly_variable_features'
uns: 'scrublet', 'status', 'status_args', 'REFERENCE_MANU', 'log1p', 'hvg'
layers: 'counts'
%%timeov.pp.scale(adata)adata
CPU times: user 1.46 s, sys: 4.36 s, total: 5.82 s
Wall time: 715 ms
AnnData object with n_obs × n_vars = 1094 × 2000
obs: 'nUMIs', 'mito_perc', 'detected_genes', 'cell_complexity', 'passing_mt', 'passing_nUMIs', 'passing_ngenes', 'n_genes', 'doublet_score', 'predicted_doublet'
var: 'mt', 'n_cells', 'percent_cells', 'robust', 'means', 'variances', 'residual_variances', 'highly_variable_rank', 'highly_variable_features'
uns: 'scrublet', 'status', 'status_args', 'REFERENCE_MANU', 'log1p', 'hvg'
layers: 'counts', 'scaled'
%%timeov.pp.pca(adata,layer='scaled',n_pcs=50)adata
computing PCA🔍
with n_comps=50
🖥️ Using sklearn PCA for CPU computation
🖥️ sklearn PCA backend: CPU computation
finished✅ (0:00:00)
CPU times: user 5.97 s, sys: 23.4 s, total: 29.3 s
Wall time: 593 ms
AnnData object with n_obs × n_vars = 1094 × 2000
obs: 'nUMIs', 'mito_perc', 'detected_genes', 'cell_complexity', 'passing_mt', 'passing_nUMIs', 'passing_ngenes', 'n_genes', 'doublet_score', 'predicted_doublet'
var: 'mt', 'n_cells', 'percent_cells', 'robust', 'means', 'variances', 'residual_variances', 'highly_variable_rank', 'highly_variable_features'
uns: 'scrublet', 'status', 'status_args', 'REFERENCE_MANU', 'log1p', 'hvg', 'pca', 'scaled|original|pca_var_ratios', 'scaled|original|cum_sum_eigenvalues'
obsm: 'X_pca', 'scaled|original|X_pca'
varm: 'PCs', 'scaled|original|pca_loadings'
layers: 'counts', 'scaled'