Removing ambient / contamination RNA from droplet scRNA-seq

Removing ambient / contamination RNA from droplet scRNA-seq#

Every droplet-based single-cell library is contaminated with ambient RNA — cell-free transcripts that leak out of stressed or lysed cells into the suspension and get co-encapsulated with real cells. This “soup” is captured and sequenced alongside each cell’s native transcriptome, so every cell barcode carries a background of transcripts it never expressed.

Left uncorrected, ambient RNA is not harmless noise:

it inflates marker genes in cell types that never expressed them — a T-cell barcode picks up haemoglobin from lysed erythrocytes, a neuron picks up a glial marker;
it distorts differential expression and cluster annotation, and produces phantom “doublet-like” or “low-quality” populations;
it scales with the soup concentration, so it is worst exactly in the samples that are hardest to dissociate (tumours, brain, frozen tissue).

Several methods estimate and subtract this background:

SoupX (Young & Behjati, GigaScience 2020) — reads the soup profile out of the empty droplets and subtracts a per-cell fraction.
DecontX (Yang et al., Genome Biology 2020) — a variational-EM model that needs a clustered matrix, no empty droplets required.
FastCAR (Berg et al. 2023) — deterministic per-gene subtraction from the empty-droplet ambient profile.
scCDC (Wang et al. 2024) — corrects only the Global-Contamination-causing Genes, deliberately conservative.
CellBender (Fleming et al., Nature Methods 2023) — a deep generative model; best overall in benchmarks but GPU-heavy.

ov.pp.ambient threads the four native, pip-installable backends (pysoupx, pydecontx, pyfastcar, pysccdc) behind a single remove_ambient dispatcher, and adds diagnostics that audit the correction.

Where this step belongs. Ambient removal operates on raw counts, so it runs after cell calling (you need the filtered cells, and for SoupX / FastCAR also the empty droplets) and before normalization, HVG selection and clustering. In this tutorial we cluster first only because DecontX / scCDC need cluster labels — in a real pipeline you would do a quick preliminary clustering, decontaminate, then re-run the full downstream analysis on the cleaned counts.

1. Load a real raw 10x dataset#

ov.datasets.pbmc_raw_10x() downloads the 10x Genomics pbmc_1k_v3 dataset (peripheral-blood mononuclear cells, Chromium 3’ v3, processed with Cell Ranger 3.0.0). With raw_droplets=True it returns two AnnData objects:

cells — the filtered real cells (1,222 cells x 15,300 genes, raw UMI counts in .X);
raw — the raw unfiltered droplet matrix (13,222 droplets = the 1,222 cells plus 12,000 empty droplets).

SoupX and FastCAR build their soup profile from those empty droplets, so they need the raw object; DecontX and scCDC work from the filtered, clustered cells alone.

import omicverse as ov
import scanpy as sc
import numpy as np
import matplotlib.pyplot as plt

ov.plot_set()

🔬 Starting plot initialization...
🧬 Detecting GPU devices…
🚫 No GPU devices found (CUDA/MPS/ROCm/XPU)

   ____            _     _    __                  
  / __ \____ ___  (_)___| |  / /__  _____________ 
 / / / / __ `__ \/ / ___/ | / / _ \/ ___/ ___/ _ \ 
/ /_/ / / / / / / / /__ | |/ /  __/ /  (__  )  __/ 
\____/_/ /_/ /_/_/\___/ |___/\___/_/  /____/\___/                                              

🔖 Version: 2.2.1rc1   📚 Tutorials: https://omicverse.readthedocs.io/
✅ plot_set complete.

cells, raw = ov.datasets.pbmc_raw_10x(raw_droplets=True)
cells, raw

🔍 Downloading data to ./data/pbmc_raw_10x_cells.h5ad
⚠️ File ./data/pbmc_raw_10x_cells.h5ad already exists
🔍 Downloading data to ./data/pbmc_raw_10x_raw.h5ad
⚠️ File ./data/pbmc_raw_10x_raw.h5ad already exists

(AnnData object with n_obs × n_vars = 1222 × 15300
     obs: 'n_counts', 'n_genes'
     var: 'gene_ids', 'feature_types', 'genome'
     uns: 'dataset',
 AnnData object with n_obs × n_vars = 13222 × 15300
     obs: 'droplet_type', 'n_counts'
     var: 'gene_ids', 'feature_types', 'genome'
     uns: 'dataset')

# the raw matrix mixes real cells and empty droplets
raw.obs['droplet_type'].value_counts()

droplet_type
empty    12000
cell      1222
Name: count, dtype: int64

The barcode-rank plot#

The classic way to see the soup is the barcode-rank plot: total UMIs per droplet, sorted from highest to lowest on a log-log scale. Real cells sit on the high plateau; the long low tail is empty droplets that contain only ambient RNA — that tail is the soup profile SoupX and FastCAR estimate from.

counts = np.asarray(raw.X.sum(axis=1)).ravel()
order = np.argsort(counts)[::-1]
is_cell = (raw.obs['droplet_type'] == 'cell').to_numpy()[order]

fig, ax = plt.subplots(figsize=(5, 4))
rank = np.arange(1, len(counts) + 1)
ax.loglog(rank[is_cell], counts[order][is_cell], '.', ms=3,
          color='#C44E52', label='called cells')
ax.loglog(rank[~is_cell], counts[order][~is_cell], '.', ms=3,
          color='#4C72B0', label='empty droplets (soup)')
ax.set_xlabel('barcode rank')
ax.set_ylabel('total UMI counts')
ax.set_title('Barcode-rank plot — pbmc_1k_v3')
ax.legend(frameon=False)
ax.spines[['top', 'right']].set_visible(False)
plt.tight_layout()
plt.show()

../_images/6a639cbc44288949e7c7d6769f71c3b7eece8e0b92e70058b98fd794496016a0.png

2. A quick clustering#

DecontX and scCDC need cluster labels, and SoupX uses them to auto-estimate the contamination fraction. We run a standard omicverse preprocessing + Leiden pipeline on a working copy, then transfer the labels and the UMAP back onto the raw-count cells object — ambient removal must operate on raw counts across all genes, not on the log-normalised HVG subset.

work = cells.copy()
work = ov.pp.preprocess(work, mode='shiftlog|pearson', n_HVGs=2000)
work.raw = work
work = work[:, work.var.highly_variable_features].copy()

🔍 [2026-05-22 00:17:46] Running preprocessing in 'cpu' mode...
Begin robust gene identification
    After filtration, 15300/15300 genes are kept.
    Among 15300 genes, 15300 genes are robust.
✅ Robust gene identification completed successfully.
Begin size normalization: shiftlog and HVGs selection pearson

🔍 Count Normalization:
   Target sum: 500000.0
   Exclude highly expressed: True
   Max fraction threshold: 0.2
   ⚠️ Excluding 5 highly-expressed genes from normalization computation
   Excluded genes: ['IGKC', 'HBB', 'MALAT1', 'MT-CO1', 'MT-CO3']

✅ Count Normalization Completed Successfully!
   ✓ Processed: 1,222 cells × 15,300 genes
   ✓ Runtime: 0.07s

🔍 Highly Variable Genes Selection (Experimental):
   Method: pearson_residuals
   Target genes: 2,000
   Theta (overdispersion): 100
✅ Experimental HVG Selection Completed Successfully!
   ✓ Selected: 2,000 highly variable genes out of 15,300 total (13.1%)
   ✓ Results added to AnnData object:
     • 'highly_variable': Boolean vector (adata.var)
     • 'highly_variable_rank': Float vector (adata.var)
     • 'highly_variable_nbatches': Int vector (adata.var)
     • 'highly_variable_intersection': Boolean vector (adata.var)
     • 'means': Float vector (adata.var)
     • 'variances': Float vector (adata.var)
     • 'residual_variances': Float vector (adata.var)
    Time to analyze data in cpu: 1.31 seconds.
✅ Preprocessing completed successfully.
    Added:
        'highly_variable_features', boolean vector (adata.var)
        'means', float vector (adata.var)
        'variances', float vector (adata.var)
        'residual_variances', float vector (adata.var)
        'counts', raw counts layer (adata.layers)
    End of size normalization: shiftlog and HVGs selection pearson

╭─ SUMMARY: preprocess ──────────────────────────────────────────────╮
│  Duration: 1.3926s                                                 │
│  Shape:    1,222 x 15,300 (Unchanged)                              │
│                                                                    │
│  CHANGES DETECTED                                                  │
│  ────────────────                                                  │
│  ● VAR    │ ✚ highly_variable (bool)                               │
│           │ ✚ highly_variable_features (bool)                      │
│           │ ✚ highly_variable_rank (float)                         │
│           │ ✚ means (float)                                        │
│           │ ✚ n_cells (int)                                        │
│           │ ✚ percent_cells (float)                                │
│           │ ✚ residual_variances (float)                           │
│           │ ✚ robust (bool)                                        │
│           │ ✚ variances (float)                                    │
│                                                                    │
│  ● UNS    │ ✚ REFERENCE_MANU                                       │
│           │ ✚ _ov_provenance                                       │
│           │ ✚ history_log                                          │
│           │ ✚ hvg                                                  │
│           │ ✚ log1p                                                │
│           │ ✚ status                                               │
│           │ ✚ status_args                                          │
│                                                                    │
│  ● LAYERS │ ✚ counts (sparse matrix, 1222x15300)                   │
│                                                                    │
╰────────────────────────────────────────────────────────────────────╯

ov.pp.scale(work)
ov.pp.pca(work, layer='scaled', n_pcs=30)
ov.pp.neighbors(work, n_neighbors=15, n_pcs=30, use_rep='scaled|original|X_pca')
ov.pp.leiden(work, resolution=1.0)
ov.pp.umap(work)

╭─ SUMMARY: scale ───────────────────────────────────────────────────╮
│  Duration: 0.0259s                                                 │
│  Shape:    1,222 x 2,000 (Unchanged)                               │
│                                                                    │
│  CHANGES DETECTED                                                  │
│  ────────────────                                                  │
│  ● LAYERS │ ✚ scaled (array, 1222x2000)                            │
│                                                                    │
╰────────────────────────────────────────────────────────────────────╯
computing PCA🔍
    with n_comps=30
   🖥️ Using sklearn PCA for CPU computation
   🖥️ sklearn PCA backend: CPU computation
   📊 PCA input data type: ArrayView, shape: (1222, 2000), dtype: float64
🔧 PCA solver used: covariance_eigh
    finished✅ (5.87s)

╭─ SUMMARY: pca ─────────────────────────────────────────────────────╮
│  Duration: 5.8701s                                                 │
│  Shape:    1,222 x 2,000 (Unchanged)                               │
│                                                                    │
│  CHANGES DETECTED                                                  │
│  ────────────────                                                  │
│  ● UNS    │ ✚ pca                                                  │
│           │ └─ params: {'zero_center': True, 'use_highly_variable': Tr...│
│           │ ✚ scaled|original|cum_sum_eigenvalues                  │
│           │ ✚ scaled|original|pca_var_ratios                       │
│                                                                    │
│  ● OBSM   │ ✚ X_pca (array, 1222x30)                               │
│           │ ✚ scaled|original|X_pca (array, 1222x30)               │
│                                                                    │
╰────────────────────────────────────────────────────────────────────╯
🖥️ Using Scanpy CPU to calculate neighbors...

🔍 K-Nearest Neighbors Graph Construction:
   Mode: cpu
   Neighbors: 15
   Method: umap
   Metric: euclidean
   Representation: scaled|original|X_pca
   PCs used: 30
   🔍 Computing neighbor distances...
🔍 Computing connectivity matrix...
   💡 Using UMAP-style connectivity
✓ Graph is fully connected

✅ KNN Graph Construction Completed Successfully!
   ✓ Processed: 1,222 cells with 15 neighbors each
   ✓ Results added to AnnData object:
     • 'neighbors': Neighbors metadata (adata.uns)
     • 'distances': Distance matrix (adata.obsp)
     • 'connectivities': Connectivity matrix (adata.obsp)

╭─ SUMMARY: neighbors ───────────────────────────────────────────────╮
│  Duration: 11.3594s                                                │
│  Shape:    1,222 x 2,000 (Unchanged)                               │
│                                                                    │
│  CHANGES DETECTED                                                  │
│  ────────────────                                                  │
│  ● UNS    │ ✚ neighbors                                            │
│           │ └─ params: {'n_neighbors': 15, 'method': 'umap', 'random_s...│
│                                                                    │
│  ● OBSP   │ ✚ connectivities (sparse matrix, 1222x1222)            │
│           │ ✚ distances (sparse matrix, 1222x1222)                 │
│                                                                    │
╰────────────────────────────────────────────────────────────────────╯
🖥️ Using Scanpy CPU Leiden...
running Leiden clustering
    finished (0.06s)
    found 13 clusters and added
    'leiden', the cluster labels (adata.obs, categorical)

╭─ SUMMARY: leiden ──────────────────────────────────────────────────╮
│  Duration: 0.0586s                                                 │
│  Shape:    1,222 x 2,000 (Unchanged)                               │
│                                                                    │
│  CHANGES DETECTED                                                  │
│  ────────────────                                                  │
│  ● OBS    │ ✚ leiden (category)                                    │
│                                                                    │
│  ● UNS    │ ✚ leiden                                               │
│           │ └─ params: {'resolution': 1.0, 'random_state': 0, 'n_itera...│
│                                                                    │
╰────────────────────────────────────────────────────────────────────╯
🔍 [2026-05-22 00:18:04] Running UMAP in 'cpu' mode...
🖥️ Using Scanpy CPU UMAP...

🔍 UMAP Dimensionality Reduction:
   Mode: cpu
   Method: umap
   Components: 2
   Min distance: 0.5
{'n_neighbors': 15, 'method': 'umap', 'random_state': 0, 'metric': 'euclidean', 'use_rep': 'scaled|original|X_pca', 'n_pcs': 30}
   🔍 Computing UMAP parameters...
   🔍 Computing UMAP embedding (classic method)...
✅ UMAP Dimensionality Reduction Completed Successfully!
   ✓ Embedding shape: 1,222 cells × 2 dimensions
   ✓ Results added to AnnData object:
     • 'X_umap': UMAP coordinates (adata.obsm)
     • 'umap': UMAP parameters (adata.uns)
✅ UMAP completed successfully.

╭─ SUMMARY: umap ────────────────────────────────────────────────────╮
│  Duration: 0.794s                                                  │
│  Shape:    1,222 x 2,000 (Unchanged)                               │
│                                                                    │
│  CHANGES DETECTED                                                  │
│  ────────────────                                                  │
│  ● UNS    │ ✚ umap                                                 │
│           │ └─ params: {'a': np.float64(0.5830300203414425), 'b': np.f...│
│                                                                    │
│  ● OBSM   │ ✚ X_umap (array, 1222x2)                               │
│                                                                    │
╰────────────────────────────────────────────────────────────────────╯

# transfer the cluster labels + embedding onto the raw-count AnnData
cells.obs['leiden'] = work.obs['leiden'].values
cells.obsm['X_umap'] = work.obsm['X_umap']
cells.obs['leiden'].value_counts()

leiden
   224
   182
   125
   122
   108
    72
    69
    64
    63
    57
   54
   47
   35
Name: count, dtype: int64

3. Run `remove_ambient` with each native method#

ov.pp.ambient.remove_ambient is a single method= dispatcher. It writes the decontaminated counts back into .X, keeps the original counts in layers['ambient_raw'], stores the per-cell contamination fraction in obs['ambient_contamination'], and records method metadata in uns['ambient'].

SoupX / FastCAR need the raw unfiltered matrix — pass raw=.
DecontX / scCDC need cluster labels — pass cluster_key=.

We run all four on independent copies so we can compare them.

ad_soupx = cells.copy()
ov.pp.ambient.remove_ambient(ad_soupx, method='soupx', raw=raw,
                             cluster_key='leiden', verbose=True)

773 genes passed tf-idf cut-off and 103 soup quantile filter. Taking the top 100.
Using 952 independent estimates of rho.
Estimated global rho of 0.01
[ov.pp.ambient] method=soupx  mean contamination=0.0100  genes corrected=15300

AnnData object with n_obs × n_vars = 1222 × 15300
    obs: 'n_counts', 'n_genes', 'leiden', 'ambient_contamination'
    var: 'gene_ids', 'feature_types', 'genome'
    uns: 'dataset', 'ambient'
    obsm: 'X_umap'
    layers: 'ambient_raw'

ad_decontx = cells.copy()
ov.pp.ambient.remove_ambient(ad_decontx, method='decontx',
                             cluster_key='leiden', verbose=True)

DecontX: analysing batch 'all_cells' (1222 cells)
iter 10 | converge: 0.03146
iter 20 | converge: 0.0201
iter 30 | converge: 0.009111
iter 40 | converge: 0.004174
iter 50 | converge: 0.002953
iter 60 | converge: 0.002275
iter 70 | converge: 0.001807
iter 80 | converge: 0.002192
iter 90 | converge: 0.00159
iter 100 | converge: 0.001244
iter 110 | converge: 0.001005
  iter 111 | converge: 0.001
[ov.pp.ambient] method=decontx  mean contamination=0.0856  genes corrected=15300

AnnData object with n_obs × n_vars = 1222 × 15300
    obs: 'n_counts', 'n_genes', 'leiden', 'ambient_contamination'
    var: 'gene_ids', 'feature_types', 'genome'
    uns: 'dataset', 'ambient'
    obsm: 'X_umap'
    layers: 'ambient_raw'

ad_fastcar = cells.copy()
ov.pp.ambient.remove_ambient(ad_fastcar, method='fastcar', raw=raw,
                             verbose=True)

[ov.pp.ambient] method=fastcar  mean contamination=0.0941  genes corrected=247

AnnData object with n_obs × n_vars = 1222 × 15300
    obs: 'n_counts', 'n_genes', 'leiden', 'ambient_contamination'
    var: 'gene_ids', 'feature_types', 'genome'
    uns: 'dataset', 'ambient'
    obsm: 'X_umap'
    layers: 'ambient_raw'

ad_sccdc = cells.copy()
ov.pp.ambient.remove_ambient(ad_sccdc, method='sccdc',
                             cluster_key='leiden', verbose=True)

[ov.pp.ambient] method=sccdc  mean contamination=0.1317  genes corrected=59

AnnData object with n_obs × n_vars = 1222 × 15300
    obs: 'n_counts', 'n_genes', 'leiden', 'ambient_contamination'
    var: 'gene_ids', 'feature_types', 'genome'
    uns: 'dataset', 'ambient'
    obsm: 'X_umap'
    layers: 'ambient_raw'

Per-cell contamination fraction#

Each run populates obs['ambient_contamination']. The four methods make different modelling assumptions, so they will not agree on a single number — that disagreement is itself informative, and the diagnostics below let us judge the corrections rather than trust them blindly.

import pandas as pd
summary = pd.DataFrame({
    'soupx':   [ad_soupx.uns['ambient']['contamination_fraction']],
    'decontx': [ad_decontx.uns['ambient']['contamination_fraction']],
    'fastcar': [ad_fastcar.uns['ambient']['contamination_fraction']],
    'sccdc':   [ad_sccdc.uns['ambient']['contamination_fraction']],
}, index=['mean contamination fraction'])
summary.T

	mean contamination fraction
soupx	0.010000
decontx	0.085641
fastcar	0.094148
sccdc	0.131743

# distribution of the per-cell contamination fraction (DecontX run)
ov.pp.ambient.plot_contamination(ad_decontx)
plt.show()

../_images/93fa598953261aee42b4b28e6faadc78023eefe10f1ee50f69245a04647058ab.png

# contamination tends to track cell type — show it per Leiden cluster
ov.pp.ambient.plot_contamination(ad_decontx, groupby='leiden')
plt.show()

../_images/756a5f0d2ff9fbab75b917303d50c236260e78b67504736f35270d709459ca3b.png

# overlay the contamination fraction on the UMAP
ov.pl.embedding(ad_decontx, basis='X_umap', color='ambient_contamination',
                frameon='small', show=False)
plt.show()

../_images/f4a6beb7eb9262e151678eb1cfafd94eef7a1c9ae069346828f1778b4e169df4.png

Corrected vs raw counts#

The decontaminated counts are now in .X; the originals are preserved in layers['ambient_raw']. The total UMIs only ever go down — ambient removal subtracts, it never adds.

raw_tot = np.asarray(ad_decontx.layers['ambient_raw'].sum(axis=1)).ravel()
cor_tot = np.asarray(ad_decontx.X.sum(axis=1)).ravel()

fig, ax = plt.subplots(figsize=(4.2, 4.2))
ax.scatter(raw_tot, cor_tot, s=5, alpha=0.4, color='#4C72B0')
lim = [0, raw_tot.max() * 1.02]
ax.plot(lim, lim, '--', color='#C44E52', lw=1, label='no change')
ax.set_xlabel('raw total UMIs')
ax.set_ylabel('decontaminated total UMIs')
ax.set_title('DecontX — counts only decrease')
ax.legend(frameon=False)
ax.spines[['top', 'right']].set_visible(False)
plt.tight_layout()
plt.show()

../_images/91959ff862e1f877408e48f310ce374ee11d0e82d3fcfec8a15f64f873a7719d.png

4. Diagnostics#

ov.pp.ambient ships read-only audits that check whether a correction behaved sensibly — they do not modify the data.

4.1 Negative-marker check#

A correct decontamination should drive a cell-type-specific marker back toward zero in the cell types that biologically do not express it. Haemoglobin HBB is the textbook example: it is hugely expressed in erythrocytes/platelet contamination and leaks into every other barcode as soup. ambient_negative_marker_check compares the raw vs corrected mean of HBB across cell types and reports the fold reduction in the cell types that should be HBB-negative.

hbb = ov.pp.ambient.ambient_negative_marker_check(
    ad_decontx, 'HBB', 'leiden')
hbb[['n_cells', 'raw_mean', 'corrected_mean', 'fold_reduction']].round(3)

	n_cells	raw_mean	corrected_mean	fold_reduction
celltype
0	224	0.000	0.000	inf
2	125	0.000	0.000	inf
8	63	0.000	0.000	inf
1	182	0.049	0.047	1.057000e+00
7	64	0.016	0.000	7.470206e+16
11	47	0.000	0.000	inf
6	69	0.000	0.000	inf
9	57	0.000	0.000	inf
4	108	4.056	4.056	1.000000e+00
10	54	0.000	0.000	inf
5	72	0.000	0.000	inf
3	122	0.000	0.000	inf
12	35	0.000	0.000	inf

# the aggregate over the HBB-negative cell types
hbb.attrs['summary']

{'marker': 'HBB',
 'positive_celltypes': ['4'],
 'n_negative_celltypes': 12,
 'neg_raw_mean': 0.0054229625190297765,
 'neg_corrected_mean': 0.0038977974782859297,
 'neg_fold_reduction': 1.3912889392638592}

The positive cell type (highest raw HBB — the erythroid/platelet cluster) keeps its HBB, while the HBB-negative cell types see their residual haemoglobin signal pushed down. That is exactly the behaviour ambient removal should produce: soup stripped, biology preserved.

4.2 Count-integrity check#

A trustworthy correction only subtracts: no matrix entry may grow, and only the contaminated genes should change at all. count_integrity_check verifies this against the 2026 ambient-RNA benchmark count-integrity criterion.

integ = ov.pp.ambient.count_integrity_check(
    ad_decontx.layers['ambient_raw'], ad_decontx.X)
integ

{'passed': True,
 'n_entries': 18696600,
 'n_increased': 0,
 'max_increase': 0.0,
 'n_unchanged': 16203657,
 'frac_unchanged': 0.8666632970700555,
 'frac_changed': 0.13333670292994448,
 'total_raw': 9230791.0,
 'total_corrected': 8344633.546892991,
 'total_removed': 886157.4531070087,
 'removed_fraction': 0.09600016435287168}

4.3 Contamination report#

contamination_report reads the metadata remove_ambient wrote into obs / uns and returns a compact one-row summary per method.

report = pd.concat([
    ov.pp.ambient.contamination_report(ad_soupx),
    ov.pp.ambient.contamination_report(ad_decontx),
    ov.pp.ambient.contamination_report(ad_fastcar),
    ov.pp.ambient.contamination_report(ad_sccdc),
], ignore_index=True)
report

	method	n_cells	n_genes	genes_corrected	mean_contamination	median_contamination	min_contamination	max_contamination	integrity_passed	removed_fraction	frac_genes_changed
0	soupx	1222	15300	15300	0.010000	0.010000	0.010000	0.010000	True	0.010000	0.126952
1	decontx	1222	15300	15300	0.085641	0.007672	0.000039	0.998265	True	0.096000	0.133337
2	fastcar	1222	15300	247	0.094148	0.083521	0.009843	0.382253	True	0.070733	0.013277
3	sccdc	1222	15300	59	0.131743	0.103346	0.001969	0.802548	True	0.094104	0.003494

5. Ground-truth validation on a species-mixing dataset#

On real PBMCs we have no ground truth for the soup — we can only check that the correction behaves sensibly. The human-mouse species-mixing (“barnyard”) experiment removes that ambiguity.

ov.datasets.hgmm_mixture() loads the 10x hgmm_1k_v3 dataset: a 1:1 mix of human (HEK293T) and mouse (NIH3T3) cells, sequenced against a combined hg19 + mm10 reference. Because the two species share no genes, any transcript mapped to the other species inside a cell is unambiguous ambient contamination. The per-cell minor-species read fraction is therefore a direct, label-free measure of soup — and a real correction must reduce it.

hg_cells, hg_raw = ov.datasets.hgmm_mixture(raw_droplets=True)
hg_cells.obs['species'].value_counts()

🔍 Downloading data to ./data/hgmm_mixture_cells.h5ad
⚠️ File ./data/hgmm_mixture_cells.h5ad already exists
🔍 Downloading data to ./data/hgmm_mixture_raw.h5ad
⚠️ File ./data/hgmm_mixture_raw.h5ad already exists

species
mouse    544
human    495
mixed      7
Name: count, dtype: int64

We split the genes by genome (hg19 vs mm10) and define the cross-species fraction of a count matrix as the minor-species share of each cell’s reads — min(human_frac, mouse_frac). With the genome masks fixed, this is a short vectorised expression we reuse for the raw matrix and for each decontaminated matrix.

# genome masks — fixed once, the soup measurement reuses them
genome = hg_cells.var['genome'].to_numpy()
hg_mask = genome == 'hg19'
mm_mask = genome == 'mm10'
print(f'{hg_mask.sum()} human (hg19) genes, {mm_mask.sum()} mouse (mm10) genes')

16122 human (hg19) genes, 13862 mouse (mm10) genes

# ground-truth contamination of the RAW counts: minor-species read fraction
hg_umi = np.asarray(hg_cells.X[:, hg_mask].sum(axis=1)).ravel()
mm_umi = np.asarray(hg_cells.X[:, mm_mask].sum(axis=1)).ravel()
hg_frac = hg_umi / np.maximum(hg_umi + mm_umi, 1)
gt_raw = np.minimum(hg_frac, 1.0 - hg_frac)
print(f'ground-truth mean cross-species fraction: {gt_raw.mean():.4f}')
print(f'                            median:       {np.median(gt_raw):.4f}')

ground-truth mean cross-species fraction: 0.0075
                            median:       0.0045

Roughly 0.75 % of every cell’s reads come from the wrong species — that is the ambient contamination, measured without any model. Now we cluster the cells (the two species form cleanly separated clusters), run each backend, and recompute the cross-species fraction from the decontaminated counts.

hg_work = hg_cells.copy()
hg_work = ov.pp.preprocess(hg_work, mode='shiftlog|pearson', n_HVGs=2000)
hg_work.raw = hg_work
hg_work = hg_work[:, hg_work.var.highly_variable_features].copy()
ov.pp.scale(hg_work)
ov.pp.pca(hg_work, layer='scaled', n_pcs=30)
ov.pp.neighbors(hg_work, n_neighbors=15, n_pcs=30,
                use_rep='scaled|original|X_pca')
ov.pp.leiden(hg_work, resolution=0.5)
hg_cells.obs['leiden'] = hg_work.obs['leiden'].values

🔍 [2026-05-22 00:18:46] Running preprocessing in 'cpu' mode...
Begin robust gene identification
    After filtration, 29984/29984 genes are kept.
    Among 29984 genes, 29984 genes are robust.
✅ Robust gene identification completed successfully.
Begin size normalization: shiftlog and HVGs selection pearson
🔍 Count Normalization:
   Target sum: 500000.0
   Exclude highly expressed: True
   Max fraction threshold: 0.2
⚠️ Excluding 4 highly-expressed genes from normalization computation
   Excluded genes: ['hg19_MT-CO2', 'mm10_mt-Nd2', 'mm10_mt-Nd4', 'mm10_mt-Cytb']

✅ Count Normalization Completed Successfully!
   ✓ Processed: 1,046 cells × 29,984 genes
   ✓ Runtime: 0.11s

🔍 Highly Variable Genes Selection (Experimental):
   Method: pearson_residuals
   Target genes: 2,000
   Theta (overdispersion): 100
✅ Experimental HVG Selection Completed Successfully!
   ✓ Selected: 2,000 highly variable genes out of 29,984 total (6.7%)
   ✓ Results added to AnnData object:
     • 'highly_variable': Boolean vector (adata.var)
     • 'highly_variable_rank': Float vector (adata.var)
     • 'highly_variable_nbatches': Int vector (adata.var)
     • 'highly_variable_intersection': Boolean vector (adata.var)
     • 'means': Float vector (adata.var)
     • 'variances': Float vector (adata.var)
     • 'residual_variances': Float vector (adata.var)
    Time to analyze data in cpu: 0.97 seconds.
✅ Preprocessing completed successfully.
    Added:
        'highly_variable_features', boolean vector (adata.var)
        'means', float vector (adata.var)
        'variances', float vector (adata.var)
        'residual_variances', float vector (adata.var)
        'counts', raw counts layer (adata.layers)
    End of size normalization: shiftlog and HVGs selection pearson

╭─ SUMMARY: preprocess ──────────────────────────────────────────────╮
│  Duration: 1.1255s                                                 │
│  Shape:    1,046 x 29,984 (Unchanged)                              │
│                                                                    │
│  CHANGES DETECTED                                                  │
│  ────────────────                                                  │
│  ● VAR    │ ✚ highly_variable (bool)                               │
│           │ ✚ highly_variable_features (bool)                      │
│           │ ✚ highly_variable_rank (float)                         │
│           │ ✚ means (float)                                        │
│           │ ✚ n_cells (int)                                        │
│           │ ✚ percent_cells (float)                                │
│           │ ✚ residual_variances (float)                           │
│           │ ✚ robust (bool)                                        │
│           │ ✚ variances (float)                                    │
│                                                                    │
│  ● UNS    │ ✚ REFERENCE_MANU                                       │
│           │ ✚ _ov_provenance                                       │
│           │ ✚ history_log                                          │
│           │ ✚ hvg                                                  │
│           │ ✚ log1p                                                │
│           │ ✚ status                                               │
│           │ ✚ status_args                                          │
│                                                                    │
│  ● LAYERS │ ✚ counts (sparse matrix, 1046x29984)                   │
│                                                                    │
╰────────────────────────────────────────────────────────────────────╯

╭─ SUMMARY: scale ───────────────────────────────────────────────────╮
│  Duration: 0.0273s                                                 │
│  Shape:    1,046 x 2,000 (Unchanged)                               │
│                                                                    │
│  CHANGES DETECTED                                                  │
│  ────────────────                                                  │
│  ● LAYERS │ ✚ scaled (array, 1046x2000)                            │
│                                                                    │
╰────────────────────────────────────────────────────────────────────╯
computing PCA🔍
    with n_comps=30
   🖥️ Using sklearn PCA for CPU computation
   🖥️ sklearn PCA backend: CPU computation
   📊 PCA input data type: ArrayView, shape: (1046, 2000), dtype: float64
🔧 PCA solver used: covariance_eigh
    finished✅ (1.94s)

╭─ SUMMARY: pca ─────────────────────────────────────────────────────╮
│  Duration: 1.9419s                                                 │
│  Shape:    1,046 x 2,000 (Unchanged)                               │
│                                                                    │
│  CHANGES DETECTED                                                  │
│  ────────────────                                                  │
│  ● UNS    │ ✚ pca                                                  │
│           │ └─ params: {'zero_center': True, 'use_highly_variable': Tr...│
│           │ ✚ scaled|original|cum_sum_eigenvalues                  │
│           │ ✚ scaled|original|pca_var_ratios                       │
│                                                                    │
│  ● OBSM   │ ✚ X_pca (array, 1046x30)                               │
│           │ ✚ scaled|original|X_pca (array, 1046x30)               │
│                                                                    │
╰────────────────────────────────────────────────────────────────────╯
🖥️ Using Scanpy CPU to calculate neighbors...

🔍 K-Nearest Neighbors Graph Construction:
   Mode: cpu
   Neighbors: 15
   Method: umap
   Metric: euclidean
   Representation: scaled|original|X_pca
   PCs used: 30
   🔍 Computing neighbor distances...
   🔍 Computing connectivity matrix...
   💡 Using UMAP-style connectivity
✓ Graph is fully connected

✅ KNN Graph Construction Completed Successfully!
   ✓ Processed: 1,046 cells with 15 neighbors each
   ✓ Results added to AnnData object:
     • 'neighbors': Neighbors metadata (adata.uns)
     • 'distances': Distance matrix (adata.obsp)
     • 'connectivities': Connectivity matrix (adata.obsp)

╭─ SUMMARY: neighbors ───────────────────────────────────────────────╮
│  Duration: 0.2066s                                                 │
│  Shape:    1,046 x 2,000 (Unchanged)                               │
│                                                                    │
│  CHANGES DETECTED                                                  │
│  ────────────────                                                  │
│  ● UNS    │ ✚ neighbors                                            │
│           │ └─ params: {'n_neighbors': 15, 'method': 'umap', 'random_s...│
│                                                                    │
│  ● OBSP   │ ✚ connectivities (sparse matrix, 1046x1046)            │
│           │ ✚ distances (sparse matrix, 1046x1046)                 │
│                                                                    │
╰────────────────────────────────────────────────────────────────────╯
🖥️ Using Scanpy CPU Leiden...
running Leiden clustering
    finished (0.04s)
    found 6 clusters and added
    'leiden', the cluster labels (adata.obs, categorical)

╭─ SUMMARY: leiden ──────────────────────────────────────────────────╮
│  Duration: 0.0386s                                                 │
│  Shape:    1,046 x 2,000 (Unchanged)                               │
│                                                                    │
│  CHANGES DETECTED                                                  │
│  ────────────────                                                  │
│  ● OBS    │ ✚ leiden (category)                                    │
│                                                                    │
│  ● UNS    │ ✚ leiden                                               │
│           │ └─ params: {'resolution': 0.5, 'random_state': 0, 'n_itera...│
│                                                                    │
╰────────────────────────────────────────────────────────────────────╯

# decontaminate with each native method and measure the residual soup
gt = {'raw (no correction)': gt_raw.mean()}
for method in ['soupx', 'decontx', 'fastcar', 'sccdc']:
    a = hg_cells.copy()
    if method in ('soupx', 'fastcar'):
        ov.pp.ambient.remove_ambient(a, method=method, raw=hg_raw,
                                     cluster_key='leiden')
    else:
        ov.pp.ambient.remove_ambient(a, method=method,
                                     cluster_key='leiden')
    hg_c = np.asarray(a.X[:, hg_mask].sum(axis=1)).ravel()
    mm_c = np.asarray(a.X[:, mm_mask].sum(axis=1)).ravel()
    f = hg_c / np.maximum(hg_c + mm_c, 1)
    gt[method] = float(np.minimum(f, 1.0 - f).mean())

gt_df = pd.DataFrame({'mean cross-species fraction': gt})
gt_df['reduction vs raw'] = 1.0 - gt_df['mean cross-species fraction'] / gt['raw (no correction)']
gt_df.round(4)

	mean cross-species fraction	reduction vs raw
raw (no correction)	0.0075	0.0000
soupx	0.0042	0.4452
decontx	0.0047	0.3797
fastcar	0.0042	0.4346
sccdc	0.0074	0.0181

fig, ax = plt.subplots(figsize=(5.5, 4.2), constrained_layout=True)
colors = ["#7f7f7f", "#4C72B0", "#55A868", "#C44E52", "#8172B3"]
ax.bar(gt_df.index, gt_df["mean cross-species fraction"], color=colors)
ax.axhline(gt["raw (no correction)"], ls="--", lw=1, color="#7f7f7f")
ax.set_ylabel("mean cross-species read fraction")
ax.set_title("hgmm species-mixing — ground-truth contamination", fontsize=11)
for tick in ax.get_xticklabels():
    tick.set_rotation(30)
    tick.set_ha("right")
ax.spines[["top", "right"]].set_visible(False)
plt.show()

../_images/a732fc06ae770bfafb1a790917868dc561f2818ca1f5b883886c3f8791aa6b23.png

This is the real test: the cross-species read fraction is a quantity no method can see during fitting, yet every backend reduces it. The empty-droplet methods (SoupX, FastCAR) and the EM model (DecontX) all strip a substantial share of the wrong-species reads; scCDC, which by design only touches Global-Contamination-causing Genes, moves it least — the expected behaviour of a deliberately conservative method.

6. Choosing a method#

The four native backends — plus the deep-learning options — make different trade-offs. The research consensus, and the behaviour seen above:

Method	Needs	Strengths	Watch out for
SoupX	raw matrix + empty droplets; clusters for auto-rho	gentle, interpretable, fast; reads the soup directly from empty droplets	conservative — under-corrects clean datasets
DecontX	filtered, clustered matrix	no empty droplets required; per-cell variational estimate	can over-correct if clustering is poor
FastCAR	raw matrix + empty droplets	deterministic per-gene subtraction; very fast	tuned by hard cutoffs (`empty_droplet_cutoff`)
scCDC	filtered, clustered matrix	anti-over-correction — only fixes contamination-causing genes	corrects few genes; leaves most soup if it is diffuse
CellBender	raw 10x h5 (GPU)	best overall in benchmarks; jointly calls cells + removes background	heavyweight, GPU, run via its own CLI
scAR	raw matrix (GPU)	deep ambient denoising	heavyweight, GPU

Practical guidance:

Have empty droplets and want a safe default — SoupX. It rarely hurts and is easy to reason about.
Only have a filtered matrix (a public h5ad, no raw droplets) — DecontX or scCDC.
Worried about over-correction of biologically real markers — scCDC.
Compute budget and a GPU — CellBender generally wins benchmarks, but is heavyweight; ov.pp.ambient keeps its wrapper thin and points you at CellBender’s own remove-background CLI, then load the result with ov.read.

Whatever you pick, audit it: ambient_negative_marker_check, count_integrity_check and — when you have one — a ground-truth signal like the cross-species fraction. Then continue the normal workflow (normalize, HVG, cluster, annotate) on the decontaminated .X.