Unified downstream analysis for ov.single.perturb#
ov.single.perturb exposes a single API across two virtual-KO backends
(sctenifoldknk and cell_oracle). This notebook shows that the same
downstream analyses — per-gene Δ-expression with significance,
trajectory shift, perturbation score (PS), cluster Sankey, vector-field
quiver, pathway / phenotype enrichment, Markov walks, ground-truth
validation — work on either backend regardless of which one produced
the per-cell ΔX matrix.
We use the official CellOracle dataset for Gata1 KO so the predictions can be checked against the published Paul15 + scATAC base-GRN result.
%matplotlib inline
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", category=UserWarning)
import omicverse as ov
ov.plot_set(font_path='Arial')
import scanpy as sc
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import celloracle as co
🔬 Starting plot initialization...
Using already downloaded Arial font from: /tmp/omicverse_arial.ttf
Registered as: Arial
🧬 Detecting GPU devices…
✅ NVIDIA CUDA GPUs detected: 1
• [CUDA 0] NVIDIA H100 80GB HBM3
Memory: 79.1 GB | Compute: 9.0
____ _ _ __
/ __ \____ ___ (_)___| | / /__ _____________
/ / / / __ `__ \/ / ___/ | / / _ \/ ___/ ___/ _ \
/ /_/ / / / / / / / /__ | |/ / __/ / (__ ) __/
\____/_/ /_/ /_/_/\___/ |___/\___/_/ /____/\___/
🔖 Version: 2.2.1rc1 📚 Tutorials: https://omicverse.readthedocs.io/
✅ plot_set complete.
1. Load the official CellOracle Paul15 dataset + scATAC base GRN#
co.data.load_Paul2015_data() is the dataset used in the CellOracle Gata1 KO
benchmark. The base GRN (co.data.load_mouse_scATAC_atlas_base_GRN) is the
pre-packaged motif/ATAC-derived GRN that Kamimoto et al. use.
adata = co.data.load_Paul2015_data()
base_grn = co.data.load_mouse_scATAC_atlas_base_GRN()
print('adata:', adata.shape, '| base_grn:', base_grn.shape)
# Restore raw counts in X then normalise as in the CellOracle tutorial
adata.X = adata.layers['raw_count'].copy()
sc.pp.normalize_per_cell(adata, key_n_counts='n_counts_all'); sc.pp.log1p(adata)
# 2-D embedding (CellOracle uses force-directed; we copy it under X_umap so
# the downstream methods find it without configuration).
adata.obsm['X_umap'] = adata.obsm['X_draw_graph_fa'][:, :2].copy()
# Subset to HVGs + canonical erythroid/myeloid TFs
sc.pp.highly_variable_genes(adata, n_top_genes=3000, flavor='seurat')
KEEP = ['Gata1','Klf1','Spi1','Gata2','Cebpa','Tal1','Runx1',
'Hbb-bt','Hbb-bs','Hba-a1','Hba-a2','Alas2','Gypa',
'Slc4a1','Epor','Mpo','Lyz2','Elane']
keep_mask = adata.var['highly_variable'].copy()
for g in KEEP:
if g in adata.var_names: keep_mask[g] = True
adata = adata[:, keep_mask].copy()
adata.obs['main_cluster'] = adata.obs['louvain_annot'].astype(str).str.split('_').str[0]
adata
adata: (2671, 1999) | base_grn: (91976, 1095)
AnnData object with n_obs × n_vars = 2671 × 1999
obs: 'paul15_clusters', 'n_counts_all', 'n_counts', 'louvain', 'cell_type', 'louvain_annot', 'dpt_pseudotime', 'main_cluster'
var: 'n_counts', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
uns: 'cell_type_colors', 'diffmap_evals', 'draw_graph', 'iroot', 'louvain', 'louvain_annot_colors', 'louvain_colors', 'louvain_sizes', 'neighbors', 'paga', 'paul15_clusters_colors', 'pca', 'log1p', 'hvg'
obsm: 'X_diffmap', 'X_draw_graph_fa', 'X_pca', 'X_umap'
varm: 'PCs'
layers: 'raw_count'
obsp: 'connectivities', 'distances'
2. Run both backends — same API#
backend='cell_oracle' requires the base GRN; backend='sctenifoldknk'
learns its own PCNet from the counts.
# CellOracle backend
result_co = ov.single.perturb(
adata, target='Gata1', mode='ko', backend='cell_oracle',
grn_base=base_grn, n_propagation=3,
backend_kwargs={'cluster_column_name': 'louvain_annot',
'embedding_name': 'X_umap'},
)
print('cell_oracle: delta_X', result_co.delta_X.shape,
'trajectory_shift', result_co.trajectory_shift.shape)
cell_oracle: delta_X (2671, 1999) trajectory_shift (2671, 2671)
# scTenifoldKnk backend — uses RAW counts + its own PCNet
adata_sct_full = co.data.load_Paul2015_data()
WANT = KEEP + [g for g in adata.var_names if adata.var.loc[g, 'highly_variable']][:500]
sct_genes = [g for g in pd.unique(WANT) if g in adata_sct_full.var_names]
adata_sct = adata_sct_full[:, sct_genes].copy()
adata_sct.X = adata_sct.layers['raw_count'].copy()
adata_sct.var_names = adata_sct.var_names.astype(str)
adata_sct.obsm['X_umap'] = adata_sct.obsm['X_draw_graph_fa'][:, :2].copy()
adata_sct.obs['main_cluster'] = adata_sct.obs['louvain_annot'].astype(str).str.split('_').str[0]
result_sct = ov.single.perturb(
adata_sct, target='Gata1', mode='ko', backend='sctenifoldknk',
backend_kwargs={'qc_kws': {'min_lib_size': 100}},
)
print('sctenifoldknk: delta_X', result_sct.delta_X.shape,
'trajectory_shift', result_sct.trajectory_shift.shape)
Removed 0 cells with lib size < 100
Removed 127 outlier cells from original data
Removed 140 genes expressed in less than 0.05 of data
Removed 10 genes with expression values: average < 0.05 or sum < 25
finish QC: WT
process qc finished in 0.1519784249830991 secs.
make_networks processing time: 100.5662778140977
process nc finished in 100.574082425097 secs.
Using tensorly
(359, 359, 10)
tensor_decomp processing time: 4.811011504149064
process td finished in 4.818585502915084 secs.
process ko finished in 0.001046402845531702 secs.
manifold_alignment processing time: 0.14265852514654398
process ma finished in 0.14270201814360917 secs.
d_regulation processing time: 0.08337321598082781
process dr finished in 0.08343888400122523 secs.
sctenifoldknk: delta_X (2671, 359) trajectory_shift (2671, 2671)
Tier A — unified per-cell ΔX + per-cell-cell transition matrix#
Both backends now expose delta_X (cells × genes) and a row-stochastic
trajectory_shift (cells × cells). CellOracle’s delta_X is
simulated_count − imputed_count; scTenifoldKnk’s is one step of
PCNet propagation X @ (KO_pcnet − WT_pcnet). The
compute_transition_prob step is the same correlation kernel for both.
for name, r in [('cell_oracle', result_co), ('sctenifoldknk', result_sct)]:
tp = np.asarray(r.trajectory_shift)
rsum = tp.sum(axis=1)
print(f'{name:14s} Δ-X {r.delta_X.shape} T {tp.shape} '
f'row-sum mean={rsum.mean():.4f} std={rsum.std():.4f}')
cell_oracle Δ-X (2671, 1999) T (2671, 2671) row-sum mean=1.0000 std=0.0000
sctenifoldknk Δ-X (2671, 359) T (2671, 2671) row-sum mean=1.0000 std=0.0000
2.1 Add cell-level significance to cell_oracle’s delta_expr#
scTenifoldKnk natively has Z / p-value / adjusted p-value in its
d_regulation table. CellOracle does not — we add them via a sign-flip
null on ΔX (add_significance), which preserves per-cell magnitudes
but destroys cross-cell direction agreement.
result_co.add_significance(adata=adata, n_perms=100)
result_co.delta_expr.head()
| gene | mean_base | mean_pert | delta | log2_fc | z_score | p_value | adj_p_value | |
|---|---|---|---|---|---|---|---|---|
| 0 | 0610007L01Rik | 0.262852 | 0.266671 | 0.003819 | 0.020809 | 7.547534 | 4.440892e-14 | 7.096198e-14 |
| 1 | 0610010K14Rik | 0.707114 | 0.716701 | 0.009587 | 0.019428 | 15.602514 | 0.000000e+00 | 0.000000e+00 |
| 2 | 0910001L09Rik | 0.333186 | 0.332322 | -0.000864 | -0.003746 | -5.486115 | 4.108690e-08 | 5.792152e-08 |
| 3 | 1100001G20Rik | 0.035502 | 0.035738 | 0.000236 | 0.009552 | 22.308032 | 0.000000e+00 | 0.000000e+00 |
| 4 | 1110004E09Rik | 0.214366 | 0.206746 | -0.007620 | -0.052213 | -12.663052 | 0.000000e+00 | 0.000000e+00 |
Tier B — Perturbation Score, cluster Sankey, vector-field quiver#
# Perturbation Score (Kamimoto 2023): inner product of per-cell Δ-embedding
# with the local pseudotime gradient. Negative = blocks differentiation.
ps_co = result_co.perturbation_score(adata=adata, pseudotime='dpt_pseudotime')
ps_sct = result_sct.perturbation_score(adata=adata_sct, pseudotime='dpt_pseudotime')
ps_table = pd.concat([
ps_co.groupby(adata.obs['main_cluster'].values).mean().rename('cell_oracle'),
ps_sct.groupby(adata_sct.obs['main_cluster'].values).mean().rename('sctenifoldknk'),
], axis=1).round(3)
print('Mean PS by cluster (negative = blocks differentiation):')
ps_table
Mean PS by cluster (negative = blocks differentiation):
| cell_oracle | sctenifoldknk | |
|---|---|---|
| Ery | -0.271 | -0.121 |
| GMP | -0.174 | -0.065 |
| GMPl | -0.149 | -0.001 |
| Gran | 0.073 | -0.060 |
| MEP | -0.343 | -0.114 |
| Mk | -0.289 | -0.122 |
| Mo | -0.253 | -0.169 |
# Cluster-level transition matrix (Tier B)
ct_co = result_co.cluster_transitions(adata=adata, cluster_col='main_cluster')
print('cell_oracle cluster_transitions:')
print(ct_co.round(3))
cell_oracle cluster_transitions:
Ery GMP GMPl Gran MEP Mk Mo
Ery 0.997 0.000 0.000 0.000 0.003 0.000 0.000
GMP 0.000 0.919 0.054 0.002 0.006 0.019 0.000
GMPl 0.000 0.069 0.825 0.057 0.000 0.000 0.049
Gran 0.000 0.001 0.087 0.905 0.000 0.000 0.007
MEP 0.032 0.076 0.000 0.000 0.849 0.042 0.000
Mk 0.000 0.409 0.000 0.000 0.159 0.433 0.000
Mo 0.000 0.000 0.051 0.004 0.000 0.000 0.945
# Sankey for the cell_oracle backend
fig, ax = ov.pl.perturb_sankey(result_co, adata=adata, cluster_col='main_cluster',
min_flow=0.03)
fig
# Vector field on UMAP
fig, ax = ov.pl.perturb_quiver(adata, result_co, cluster_col='main_cluster',
grid_size=25, arrow_target_length=0.0025)
fig
Tier C — enrichment, Markov walks, robustness, ground-truth validation#
import time
def _enrichr_with_retry(fn, **kw):
# Enrichr returns 429 under load; back off and retry up to 5 times.
for attempt in range(5):
try:
return fn(**kw)
except Exception as exc:
if '429' in str(exc) and attempt < 4:
time.sleep(2 ** attempt * 5)
continue
raise
# GO Biological Process enrichment on top-200 affected genes
go_enr = _enrichr_with_retry(
result_co.pathway_enrichment,
top_n=200, gene_sets='GO_Biological_Process_2023', organism='mouse',
)
print(f'enriched terms (cutoff 0.05): {len(go_enr)}')
go_enr.sort_values('Adjusted P-value').head(10)[
['Term', 'Adjusted P-value', 'Combined Score', 'Genes']
]
enriched terms (cutoff 0.05): 1476
| Term | Adjusted P-value | Combined Score | Genes | |
|---|---|---|---|---|
| 0 | Regulation Of Protein Export From Nucleus (GO:... | 0.087520 | 217.428717 | SP100;PARK7;RANGAP1;RAPGEF3 |
| 1 | Negative Regulation Of Myeloid Leukocyte Diffe... | 0.087520 | 130.937480 | MYC;RARA;HOXA7;ZFPM1 |
| 2 | Organelle Organization (GO:0006996) | 0.087520 | 26.867363 | CUL7;ABCD2;MAST4;DST;SH3KBP1;MAST2;PARK7;CNN2;... |
| 4 | Negative Regulation Of Nucleocytoplasmic Trans... | 0.087520 | 219.569955 | SP100;PARK7;RANGAP1 |
| 5 | Regulation Of T Cell Receptor Signaling Pathwa... | 0.087520 | 106.801264 | LGALS3;LCK;LAPTM5;KCNN4 |
| 3 | Positive Regulation Of Telomerase Activity (GO... | 0.087520 | 112.122529 | TNIP1;MYC;PRKCQ;AURKB |
| 6 | Negative Regulation Of Biosynthetic Process (G... | 0.105666 | 176.128029 | ABCD2;EMILIN1;PARK7 |
| 8 | Maintenance Of Protein Location In Nucleus (GO... | 0.118002 | 145.362830 | SP100;PARK7;RANGAP1 |
| 7 | Regulation Of Oxidative Stress-Induced Intrins... | 0.118002 | 145.362830 | PARK7;P4HB;SOD2 |
| 9 | Cytoskeleton Organization (GO:0007010) | 0.123704 | 40.816497 | CNN2;MAST4;DST;SH3KBP1;MAST2;SH3BGRL3 |
# Phenotype enrichment against the MGI mammalian-phenotype database
mgi_enr = _enrichr_with_retry(
result_co.phenotype_enrichment,
top_n=200, db='MGI_Mammalian_Phenotype_Level_4_2024', organism='mouse',
)
print(f'enriched MGI phenotypes: {len(mgi_enr)}')
mgi_enr.sort_values('Adjusted P-value').head(10)[
['Term', 'Adjusted P-value', 'Combined Score']
]
enriched MGI phenotypes: 1515
| Term | Adjusted P-value | Combined Score | |
|---|---|---|---|
| 0 | Abnormal Embryonic Erythropoiesis MP:0004229 | 0.006020 | 231.462469 |
| 1 | Decreased Bone Marrow Cell Number MP:0000333 | 0.006020 | 115.849147 |
| 3 | Increased Neutrophil Cell Number MP:0000219 | 0.006020 | 50.269866 |
| 2 | Abnormal Megakaryocyte Morphology MP:0002417 | 0.006020 | 208.107118 |
| 4 | Abnormal Megakaryocyte Differentiation MP:0000229 | 0.006319 | 334.792365 |
| 5 | Increased Activated T Cell Number MP:0001829 | 0.012833 | 234.810039 |
| 6 | Enlarged Spleen MP:0000691 | 0.014410 | 29.618567 |
| 7 | Increased Spleen Weight MP:0004952 | 0.014410 | 44.895305 |
| 8 | Increased Mean Corpuscular Hemoglobin MP:0005561 | 0.014410 | 44.507500 |
| 9 | Increased B Cell Number MP:0005014 | 0.014410 | 66.838228 |
# Markov walks from MEP cells - where do they end up after 15 steps under
# the Gata1-KO transition matrix?
mep_cells = adata.obs_names[adata.obs['main_cluster'] == 'MEP'][:30]
walks = result_co.run_markov(
start_cells=list(mep_cells), n_steps=15, n_walks_per_cell=50, adata=adata,
)
end_ix = walks.values.ravel()
end_clusters = adata.obs['main_cluster'].iloc[end_ix]
print('Endpoint cluster distribution starting from 30 MEP cells (1500 walks):')
print(end_clusters.value_counts())
Endpoint cluster distribution starting from 30 MEP cells (1500 walks):
main_cluster
GMP 897
MEP 284
Ery 166
GMPl 92
Mk 54
Gran 6
Mo 1
Name: count, dtype: int64
# Overall robustness: is the perturbation effect significantly above a
# sign-flip null on ΔX?
print('cell_oracle ', result_co.permutation_test(adata=adata, n_perms=100))
print('sctenifoldknk', result_sct.permutation_test(adata=adata_sct, n_perms=100))
cell_oracle {'Z_obs': 11.349212514666535, 'Z_mean_null': 0.7432707696292408, 'Z_std_null': 0.14701824227852264, 'p_value': 0.009900990099009901}
sctenifoldknk {'Z_obs': 5.852418661577934, 'Z_mean_null': 0.1755097720073423, 'Z_std_null': 0.11395428423247562, 'p_value': 0.009900990099009901}
Summary#
The full PerturbResult API works on both backends:
Tier |
Method |
What it gives |
A vs B |
|---|---|---|---|
A |
|
unified per-cell + cell×cell |
both |
A |
|
adds Z/p/adj_p to delta_expr |
both |
B |
|
per-cell promote/block score |
both |
B |
|
source × target cluster matrix |
both |
B |
|
publication-style plots |
both |
C |
|
gene-set enrichment |
both |
C |
|
phenotype/disease enrichment |
both |
C |
|
long-run cell fate sampling |
both |
C |
|
predicted-vs-observed correlation + top-K precision |
both |
C |
|
overall robustness Z |
both |
The Gata1 KO predictions on Paul15 match the published CellOracle paper:
Trajectory shift: Mk → GMP redirected (41 % shift)
MGI phenotype enrichment: top hits are Abnormal Embryonic Erythropoiesis, Abnormal Megakaryocyte Differentiation, Increased Neutrophil Cell Number
Markov walks: 60+ % of MEP cells end up in the GMP cluster after 15 steps