Single-extracellular-vesicle (single-EV) proteomics

Single-extracellular-vesicle (single-EV) proteomics#

Extracellular vesicles (EVs) — exosomes and microvesicles — are nanoscale membrane particles that every cell releases into its surroundings. They carry a surface-protein cargo that reflects their cell of origin, and they are intense biomarker candidates for liquid biopsy. Until recently EV proteomics was almost always bulk: a preparation of millions of vesicles was lysed and measured together, giving one averaged protein profile. That average hides the central fact about EVs — a preparation is a mixture of vesicle subpopulations, and a bulk measurement can never tell you whether two markers are on the same vesicle or merely in the same tube.

Single-EV proteomics measures the protein content of individual vesicles. The natural data structure is an EV x protein matrix — each row one vesicle, each column one protein/marker target — which is structurally identical to a single-cell cell x gene matrix (a vesicle plays the role of a cell, a protein marker the role of a gene). The entire single-cell analysis stack therefore transfers: QC, normalization, dimensionality reduction, subpopulation clustering, marker discovery and differential analysis.

ov.single.ev is the omicverse module for this modality. It implements the full single-EV pipeline behind one API and supports three measurement value types — sequencing counts, imaging/flow intensity and digital binary calls — so the same functions work whatever the platform.

This notebook runs the comprehensive pipeline on real sequencing-count data: the Proximity Barcoding Assay (PBA) of Wu et al., Nat Commun 2019 (10:3854; PMID 31477692), which barcodes individual exosomes and reads out their surface proteins by next-generation sequencing. Analysis follows the MISEV2023 minimal- information framework (Welsh et al., J Extracell Vesicles 2024).

1. Load the data and inspect it#

ov.datasets.ev_pba() downloads the real PBA dataset — a curated 75,000-EV x 40-surface-protein tutorial subset spanning 15 samples: 13 cancer/normal cell-line exosome populations and 2 human-serum exosome samples. Each row is one individual exosome (a PBA complex identified by its barcode); each value is a sequencing read count for one surface-protein antibody. The measurement value type is recorded in uns['ev']['value_type'] — here 'count'.

import omicverse as ov
import matplotlib.pyplot as plt

ov.plot_set(font_path='Arial')
adata = ov.datasets.ev_pba()
adata

🔬 Starting plot initialization...
Using already downloaded Arial font from: /tmp/omicverse_arial.ttf
Registered as: Arial
🧬 Detecting GPU devices…
🚫 No GPU devices found (CUDA/MPS/ROCm/XPU)

   ____            _     _    __                  
  / __ \____ ___  (_)___| |  / /__  _____________ 
 / / / / __ `__ \/ / ___/ | / / _ \/ ___/ ___/ _ \ 
/ /_/ / / / / / / / /__ | |/ /  __/ /  (__  )  __/ 
\____/_/ /_/ /_/_/\___/ |___/\___/_/  /____/\___/                                              

🔖 Version: 2.2.1rc1   📚 Tutorials: https://omicverse.readthedocs.io/
✅ plot_set complete.

🔍 Downloading data to ./data/ev_pba.h5ad
⚠️ File ./data/ev_pba.h5ad already exists

AnnData object with n_obs × n_vars = 75000 × 40
    obs: 'sample', 'source', 'sample_type', 'condition', 'complex_tag', 'total_counts', 'n_proteins'
    uns: 'ev'

# the per-EV metadata: sample, biological source, cancer/normal condition
adata.obs[['sample', 'source', 'sample_type', 'condition']].head()

	sample	source	sample_type	condition
ev_id
A549\|TCCTGTTGCTAGTGT	A549	lung adenocarcinoma	cell_line	cancer
A549\|ATCTAAAAAATACAG	A549	lung adenocarcinoma	cell_line	cancer
A549\|GAGCGGTCACATCAA	A549	lung adenocarcinoma	cell_line	cancer
A549\|ATAATATTTAGCTTA	A549	lung adenocarcinoma	cell_line	cancer
A549\|AGCTAACCTTCGGCC	A549	lung adenocarcinoma	cell_line	cancer

# how many individual exosomes per sample, and the measurement value type
print(adata.obs['sample'].value_counts())
print()
print('value type :', adata.uns['ev']['value_type'])
print('assay      :', adata.uns['ev']['assay'])

sample
A549       5000
AGS        5000
BLC21      5000
Daudi      5000
HCT116     5000
HEK293     5000
K562       5000
MKN45      5000
MKN7       5000
MM1        5000
PC3        5000
SK-N-SH    5000
Serum-1    5000
Serum-2    5000
U87MG      5000
Name: count, dtype: int64

value type : count
assay      : Proximity Barcoding Assay

2. Quality control#

Single-EV QC targets artifacts that are specific to vesicle data, not cell data. ov.single.ev.qc removes EVs with too few detected proteins (membrane fragments, free antibody, background), removes or caps EVs with implausibly high total signal (doublets / barcode collisions where two vesicles are read as one tag) and drops proteins detected in too few EVs. Here we require at least 2 detected proteins per EV and keep proteins present in at least 0.5% of vesicles.

adata = ov.single.ev.qc(adata, min_proteins=2, min_ev_frac=0.005)
qc = adata.uns['ev']['qc']
print(f"EVs: {qc['n_ev_in']:,} -> {qc['n_ev_out']:,} "
      f"({qc['n_ev_removed']:,} removed)")
print(f"proteins: {qc['n_proteins_in']} -> {qc['n_proteins_out']}")
print(f"high-signal (doublet) cut: {qc['high_signal_cut']:.1f}")

EVs: 75,000 -> 46,839 (28,161 removed)
proteins: 40 -> 40
high-signal (doublet) cut: 19.8

About 28,000 low-information EVs (membrane fragments / barcodes with too few reads) are removed, leaving ~47,000 informative exosomes — a typical attrition for sparse single-EV sequencing data. The PBA panel is small (40 antibodies) so all 40 proteins are retained.

A MISEV2023 purity assessment quantifies co-isolated non-vesicular contaminants — lipoproteins (ApoA1/ApoB), albumin, organelle proteins. contaminant_score writes per-EV scores and a preparation-level summary.

adata = ov.single.ev.contaminant_score(adata)
contam = adata.uns['ev']['contaminant']
print('preparation purity :', round(contam['purity'], 3))
print('contaminant markers found :', contam['markers_found'])

preparation purity : 1.0
contaminant markers found : {'lipoprotein': [], 'albumin': [], 'organelle': []}

The purity score is 1.0 and no contaminant markers were found — the PBA panel was deliberately designed from tetraspanins, integrins and other genuine EV surface proteins, so it contains no lipoprotein/albumin/ organelle targets to begin with. This is honest: purity here reflects panel design, not a contaminant-free preparation per se.

Before normalizing, we take a MISEV-style snapshot of the raw counts with ev_summary — once normalize overwrites X, the raw per-EV totals are no longer in X (they remain in layers['counts']).

ov.single.ev.ev_summary(adata, cluster_key=None)

	n_evs	n_proteins	n_subpopulations	n_samples	value_type	platform	mean_proteins_per_ev	median_total_signal	qc_pass_rate
0	46839	40	0	15	count	unknown	3.368368	5.0	1.0

3. Source x protein abundance — reproducing PBA Fig. 4a#

The original PBA study (Wu et al., Nat Commun 2019, Fig. 4a) opens its biological analysis with a source x protein abundance heatmap: the per-EV molecule counts are aggregated to the level of each source (the 13 cell lines plus 2 human-serum samples) and shown as log(moleculeTag+1), rows = sources, columns = antibody targets. It is the pseudobulk view of single-EV data — collapse the millions of vesicles back to one profile per source and ask which source carries which surface protein.

We reproduce it directly. ov.single.ev.pseudobulk aggregates the QC’d per-EV matrix to a 15-source x 40-protein matrix; we log1p it and draw the sample x protein heatmap with the generic ov.pl.group_heatmap. This is the bulk-level summary that precedes the single-vesicle analysis — the rest of the notebook then goes beyond it.

# PBA Fig. 4a: aggregate per-EV counts to a source x protein matrix
pb_src = ov.single.ev.pseudobulk(adata, sample_key='sample', mode='mean')
pb_src.X = ov.np.log1p(pb_src.X)            # log(moleculeTag+1), as in Fig. 4a
pb_src.obs['sample'] = pb_src.obs.index.astype('category')
ov.pl.group_heatmap(pb_src, var_names=list(pb_src.var_names), groupby='sample',
                    standard_scale='var', cmap='magma', figsize=(9, 5),
                    label='log(tag+1)')
plt.show()

../_images/1cb1e4828396ac27bb1e26905aa6bd1d27a22572270d1de272e3acf40b3a0dfa.png

4. Normalization#

The normalization step is the one place the single-cell stack must branch on the assay’s value type, so ov.single.ev.normalize is EV-specific. PBA produces sequencing counts, so the right transform is the centered-log-ratio (CLR) — the same transform CITE-seq uses for antibody-derived tags. CLR removes the per-EV composition/depth effect by dividing through the per-vesicle geometric mean. method='auto' reads uns['ev']['value_type'] and picks CLR automatically for count data. We normalize the full QC’d set here — CLR is a per-vesicle transform — so that the stratified embedding in Section 5 and every downstream subset share one consistent normalization.

# normalize the full QC'd set (CLR is a per-vesicle transform)
ov.single.ev.normalize(adata, method='auto')
print('normalization method :', adata.uns['ev']['normalize']['method'])
print('value type           :', adata.uns['ev']['value_type'])

normalization method : clr
value type           : count

5. Stratified embedding — reproducing PBA Fig. 4b#

PBA data is exceptionally sparse: most vesicles carry just 1-2 detected proteins. EVs with so few detected proteins share near-identical presence/absence patterns and cannot be resolved by source.

The original PBA study (Wu et al., Nat Commun 2019, Fig. 4b) addressed this with three t-SNE panels side by side — exosomes with 1, 2, and >=3 detected proteins — each coloured by source. The finding: 1-protein exosomes give no good distinction between sources (a radial “firework” sparsity artifact), 2-protein exosomes are still ambiguous, and only the >=3-protein subset resolves same-source exosomes into coherent regions. The paper used t-SNE and did not attempt formal clustering on the sparse data.

We reproduce all three panels faithfully. The Section-2 QC kept only EVs with >=2 detected proteins, so to recover the 1-protein stratum we apply a permissive QC (min_proteins=1) to a fresh copy of the data, normalize it, and split on obs['n_proteins'] into the 1 / 2 / >=3 strata. Each stratum is z-scored (ov.pp.scale), reduced (ov.pp.pca, small panel so a low n_pcs) and embedded with t-SNE (ov.pp.tsne, as in the paper). The three embeddings are then drawn together, coloured by source. This section both reproduces Fig. 4b and motivates running the rest of the pipeline on the informative >=3-protein subset of the standard (min_proteins=2) QC’d data.

# permissive QC on a fresh copy recovers the 1-protein EVs for Fig. 4b
viz = ov.single.ev.qc(ov.datasets.ev_pba(), min_proteins=1, min_ev_frac=0.005)
ov.single.ev.normalize(viz, method='auto')
ev1 = viz[viz.obs['n_proteins'] == 1].copy()
ev2 = viz[viz.obs['n_proteins'] == 2].copy()
ev3 = viz[viz.obs['n_proteins'] >= 3].copy()
print(f"1-protein: {ev1.n_obs:,}   2-protein: {ev2.n_obs:,}   "
      f">=3-protein: {ev3.n_obs:,}  of {viz.n_obs:,} EVs")

🔍 Downloading data to ./data/ev_pba.h5ad
⚠️ File ./data/ev_pba.h5ad already exists
1-protein: 21,776   2-protein: 17,781   >=3-protein: 29,058  of 68,615 EVs

# Fig. 4b panel 1: embed the 1-detected-protein stratum (scale -> PCA -> t-SNE)
ov.pp.scale(ev1, max_value=10, layers_add='scaled')
ov.pp.pca(ev1, layer='scaled', n_pcs=30)
ov.pp.tsne(ev1, use_rep='scaled|original|X_pca', n_pcs=30)

╭─ SUMMARY: scale ───────────────────────────────────────────────────╮
│  Duration: 0.0331s                                                 │
│  Shape:    21,776 x 40 (Unchanged)                                 │
│                                                                    │
│  CHANGES DETECTED                                                  │
│  ────────────────                                                  │
│  ● UNS    │ ✚ REFERENCE_MANU                                       │
│           │ ✚ _ov_provenance                                       │
│           │ ✚ status                                               │
│           │ ✚ status_args                                          │
│                                                                    │
│  ● LAYERS │ ✚ scaled (array, 21776x40)                             │
│                                                                    │
╰────────────────────────────────────────────────────────────────────╯
computing PCA🔍
    with n_comps=30
   🖥️ Using sklearn PCA for CPU computation
   🖥️ sklearn PCA backend: CPU computation
   📊 PCA input data type: ndarray, shape: (21776, 40), dtype: float64
   🔧 PCA solver used: covariance_eigh
    finished✅ (0.04s)

╭─ SUMMARY: pca ─────────────────────────────────────────────────────╮
│  Duration: 0.0439s                                                 │
│  Shape:    21,776 x 40 (Unchanged)                                 │
│                                                                    │
│  CHANGES DETECTED                                                  │
│  ────────────────                                                  │
│  ● UNS    │ ✚ pca                                                  │
│           │ └─ params: {'zero_center': True, 'use_highly_variable': Fa...│
│           │ ✚ scaled|original|cum_sum_eigenvalues                  │
│           │ ✚ scaled|original|pca_var_ratios                       │
│                                                                    │
│  ● OBSM   │ ✚ X_pca (array, 21776x30)                              │
│           │ ✚ scaled|original|X_pca (array, 21776x30)              │
│                                                                    │
╰────────────────────────────────────────────────────────────────────╯
🖥️ Using sklearn CPU t-SNE...
╭─ SUMMARY: tsne ────────────────────────────────────────────────────╮
│  Duration: 41.2834s                                                │
│  Shape:    21,776 x 40 (Unchanged)                                 │
│                                                                    │
│  CHANGES DETECTED                                                  │
│  ────────────────                                                  │
│  ● UNS    │ ✚ tsne                                                 │
│           │ └─ params: {'n_components': 2, 'perplexity': 30, 'early_ex...│
│                                                                    │
│  ● OBSM   │ ✚ X_tsne (array, 21776x2)                              │
│                                                                    │
╰────────────────────────────────────────────────────────────────────╯

# Fig. 4b panel 2: embed the 2-detected-protein stratum
ov.pp.scale(ev2, max_value=10, layers_add='scaled')
ov.pp.pca(ev2, layer='scaled', n_pcs=30)
ov.pp.tsne(ev2, use_rep='scaled|original|X_pca', n_pcs=30)

╭─ SUMMARY: scale ───────────────────────────────────────────────────╮
│  Duration: 0.0135s                                                 │
│  Shape:    17,781 x 40 (Unchanged)                                 │
│                                                                    │
│  CHANGES DETECTED                                                  │
│  ────────────────                                                  │
│  ● UNS    │ ✚ REFERENCE_MANU                                       │
│           │ ✚ _ov_provenance                                       │
│           │ ✚ status                                               │
│           │ ✚ status_args                                          │
│                                                                    │
│  ● LAYERS │ ✚ scaled (array, 17781x40)                             │
│                                                                    │
╰────────────────────────────────────────────────────────────────────╯
computing PCA🔍
    with n_comps=30
   🖥️ Using sklearn PCA for CPU computation
   🖥️ sklearn PCA backend: CPU computation
   📊 PCA input data type: ndarray, shape: (17781, 40), dtype: float64
   🔧 PCA solver used: covariance_eigh
    finished✅ (0.03s)

╭─ SUMMARY: pca ─────────────────────────────────────────────────────╮
│  Duration: 0.0309s                                                 │
│  Shape:    17,781 x 40 (Unchanged)                                 │
│                                                                    │
│  CHANGES DETECTED                                                  │
│  ────────────────                                                  │
│  ● UNS    │ ✚ pca                                                  │
│           │ └─ params: {'zero_center': True, 'use_highly_variable': Fa...│
│           │ ✚ scaled|original|cum_sum_eigenvalues                  │
│           │ ✚ scaled|original|pca_var_ratios                       │
│                                                                    │
│  ● OBSM   │ ✚ X_pca (array, 17781x30)                              │
│           │ ✚ scaled|original|X_pca (array, 17781x30)              │
│                                                                    │
╰────────────────────────────────────────────────────────────────────╯
🖥️ Using sklearn CPU t-SNE...
╭─ SUMMARY: tsne ────────────────────────────────────────────────────╮
│  Duration: 59.0307s                                                │
│  Shape:    17,781 x 40 (Unchanged)                                 │
│                                                                    │
│  CHANGES DETECTED                                                  │
│  ────────────────                                                  │
│  ● UNS    │ ✚ tsne                                                 │
│           │ └─ params: {'n_components': 2, 'perplexity': 30, 'early_ex...│
│                                                                    │
│  ● OBSM   │ ✚ X_tsne (array, 17781x2)                              │
│                                                                    │
╰────────────────────────────────────────────────────────────────────╯

# Fig. 4b panel 3: embed the informative >=3-detected-protein stratum
ov.pp.scale(ev3, max_value=10, layers_add='scaled')
ov.pp.pca(ev3, layer='scaled', n_pcs=30)
ov.pp.tsne(ev3, use_rep='scaled|original|X_pca', n_pcs=30)
print('Fig. 4b panels embedded:', ev1.n_obs, ev2.n_obs, ev3.n_obs, 'EVs')

╭─ SUMMARY: scale ───────────────────────────────────────────────────╮
│  Duration: 0.0248s                                                 │
│  Shape:    29,058 x 40 (Unchanged)                                 │
│                                                                    │
│  CHANGES DETECTED                                                  │
│  ────────────────                                                  │
│  ● UNS    │ ✚ REFERENCE_MANU                                       │
│           │ ✚ _ov_provenance                                       │
│           │ ✚ status                                               │
│           │ ✚ status_args                                          │
│                                                                    │
│  ● LAYERS │ ✚ scaled (array, 29058x40)                             │
│                                                                    │
╰────────────────────────────────────────────────────────────────────╯
computing PCA🔍
    with n_comps=30
   🖥️ Using sklearn PCA for CPU computation
   🖥️ sklearn PCA backend: CPU computation
   📊 PCA input data type: ndarray, shape: (29058, 40), dtype: float64
   🔧 PCA solver used: covariance_eigh
    finished✅ (0.03s)

╭─ SUMMARY: pca ─────────────────────────────────────────────────────╮
│  Duration: 0.0327s                                                 │
│  Shape:    29,058 x 40 (Unchanged)                                 │
│                                                                    │
│  CHANGES DETECTED                                                  │
│  ────────────────                                                  │
│  ● UNS    │ ✚ pca                                                  │
│           │ └─ params: {'zero_center': True, 'use_highly_variable': Fa...│
│           │ ✚ scaled|original|cum_sum_eigenvalues                  │
│           │ ✚ scaled|original|pca_var_ratios                       │
│                                                                    │
│  ● OBSM   │ ✚ X_pca (array, 29058x30)                              │
│           │ ✚ scaled|original|X_pca (array, 29058x30)              │
│                                                                    │
╰────────────────────────────────────────────────────────────────────╯
🖥️ Using sklearn CPU t-SNE...
╭─ SUMMARY: tsne ────────────────────────────────────────────────────╮
│  Duration: 85.7596s                                                │
│  Shape:    29,058 x 40 (Unchanged)                                 │
│                                                                    │
│  CHANGES DETECTED                                                  │
│  ────────────────                                                  │
│  ● UNS    │ ✚ tsne                                                 │
│           │ └─ params: {'n_components': 2, 'perplexity': 30, 'early_ex...│
│                                                                    │
│  ● OBSM   │ ✚ X_tsne (array, 29058x2)                              │
│                                                                    │
╰────────────────────────────────────────────────────────────────────╯
Fig. 4b panels embedded: 21776 17781 29058 EVs

# PBA Fig. 4b: three t-SNE panels (1 / 2 / >=3 proteins), coloured by source
fig, axes = plt.subplots(1, 3, figsize=(16, 4.6))
ov.pl.embedding(ev1, basis='X_tsne', color='source', ax=axes[0], show=False,
                frameon='small', legend_loc=None, title='1 protein / EV')
ov.pl.embedding(ev2, basis='X_tsne', color='source', ax=axes[1], show=False,
                frameon='small', legend_loc=None, title='2 proteins / EV')
ov.pl.embedding(ev3, basis='X_tsne', color='source', ax=axes[2], show=False,
                frameon='small', title='>=3 proteins / EV')
plt.tight_layout()
plt.show()

../_images/2528e0077ca13a79e8ded5e2a700c58a0b4675d444d1090be4b7ceacf1cfa0bc.png

As in the paper’s Fig. 4b, the 1-protein panel is a sparsity artifact — EVs radiate into a radial “firework” with no source separation, because a single detected protein cannot distinguish one source from another. The 2-protein panel is still largely ambiguous. Only the >=3-protein panel resolves: same-source exosomes fall into coherent regions of the embedding. This is the empirical justification for analysing the informative >=3-protein subset — we now take that subset of the standard (min_proteins=2) QC’d, normalized adata as the working object, z-score it and run PCA, then continue with clustering, marker discovery and differential analysis.

# adopt the informative >=3-protein subset of the standard-QC'd data
adata_full = adata
adata = adata_full[adata_full.obs['n_proteins'] >= 3].copy()
ov.pp.scale(adata, max_value=10, layers_add='scaled')
ov.pp.pca(adata, layer='scaled', n_pcs=30)
print(f"working set (>=3 proteins): {adata.n_obs:,} of {adata_full.n_obs:,} EVs")

╭─ SUMMARY: scale ───────────────────────────────────────────────────╮
│  Duration: 0.0151s                                                 │
│  Shape:    29,058 x 40 (Unchanged)                                 │
│                                                                    │
│  CHANGES DETECTED                                                  │
│  ────────────────                                                  │
│  ● UNS    │ ✚ REFERENCE_MANU                                       │
│           │ ✚ _ov_provenance                                       │
│           │ ✚ status                                               │
│           │ ✚ status_args                                          │
│                                                                    │
│  ● LAYERS │ ✚ scaled (array, 29058x40)                             │
│                                                                    │
╰────────────────────────────────────────────────────────────────────╯
computing PCA🔍
    with n_comps=30
   🖥️ Using sklearn PCA for CPU computation
   🖥️ sklearn PCA backend: CPU computation
   📊 PCA input data type: ndarray, shape: (29058, 40), dtype: float64
   🔧 PCA solver used: covariance_eigh
    finished✅ (0.03s)

╭─ SUMMARY: pca ─────────────────────────────────────────────────────╮
│  Duration: 0.0295s                                                 │
│  Shape:    29,058 x 40 (Unchanged)                                 │
│                                                                    │
│  CHANGES DETECTED                                                  │
│  ────────────────                                                  │
│  ● UNS    │ ✚ pca                                                  │
│           │ └─ params: {'zero_center': True, 'use_highly_variable': Fa...│
│           │ ✚ scaled|original|cum_sum_eigenvalues                  │
│           │ ✚ scaled|original|pca_var_ratios                       │
│                                                                    │
│  ● OBSM   │ ✚ X_pca (array, 29058x30)                              │
│           │ ✚ scaled|original|X_pca (array, 29058x30)              │
│                                                                    │
╰────────────────────────────────────────────────────────────────────╯
working set (>=3 proteins): 29,058 of 46,839 EVs

6. The EV neighbor graph#

The working subset was z-scored and reduced with PCA in the previous cell (scaling and PCA are generic single-cell preprocessing — an EV x protein matrix is structurally a cell x gene matrix). All that remains before clustering is the k-nearest-neighbor graph over the EVs, built with the omicverse-native ov.pp.neighbors. Protein panels are small, so there is no highly-variable-gene step — every protein is informative and kept — and n_pcs stays capped below the 40-protein panel size.

# the kNN graph is a generic single-cell step -> omicverse-native ov.pp
ov.pp.neighbors(adata, n_neighbors=15, n_pcs=30,
                use_rep='scaled|original|X_pca')
print('PCA components :', adata.obsm['scaled|original|X_pca'].shape[1])
print('variance explained by PC1-3 :',
      adata.uns['pca']['variance_ratio'][:3].round(3))

🖥️ Using Scanpy CPU to calculate neighbors...

🔍 K-Nearest Neighbors Graph Construction:
   Mode: cpu
   Neighbors: 15
   Method: umap
   Metric: euclidean
   Representation: scaled|original|X_pca
   PCs used: 30
   🔍 Computing neighbor distances...
🔍 Computing connectivity matrix...
   💡 Using UMAP-style connectivity
✓ Graph is fully connected

✅ KNN Graph Construction Completed Successfully!
   ✓ Processed: 29,058 cells with 15 neighbors each
   ✓ Results added to AnnData object:
     • 'neighbors': Neighbors metadata (adata.uns)
     • 'distances': Distance matrix (adata.obsp)
     • 'connectivities': Connectivity matrix (adata.obsp)

╭─ SUMMARY: neighbors ───────────────────────────────────────────────╮
│  Duration: 35.1067s                                                │
│  Shape:    29,058 x 40 (Unchanged)                                 │
│                                                                    │
│  CHANGES DETECTED                                                  │
│  ────────────────                                                  │
│  ● UNS    │ ✚ neighbors                                            │
│           │ └─ params: {'n_neighbors': 15, 'method': 'umap', 'random_s...│
│                                                                    │
│  ● OBSP   │ ✚ connectivities (sparse matrix, 29058x29058)          │
│           │ ✚ distances (sparse matrix, 29058x29058)               │
│                                                                    │
╰────────────────────────────────────────────────────────────────────╯
PCA components : 30
variance explained by PC1-3 : [0.077 0.044 0.04 ]

7. EV-subpopulation clustering#

EV-subpopulation discovery is the heart of single-EV analysis. We use FlowSOM — the cytometry-standard clustering for marker-panel data — on the informative ≥3-protein subset. A self-organizing map is trained on the EV x protein matrix, then the SOM nodes are hierarchically metaclustered into the requested number of vesicle subpopulations. omicverse ships a native pure-Python FlowSOM, so there is no R/Java dependency.

A caveat worth stating up front: this is a targeted 40-plex panel and the data, even after the informative-subset filter, is still sparse. The FlowSOM marker heatmap / dotplot (Section 10) is therefore the primary structural readout — it is what defines and validates each subpopulation by its surface-protein program. The 2-D embedding below is a coarse visual aid, not a source of crisp, well-separated clusters.

ov.single.ev.flowsom(adata, n_clusters=8, grid=(10, 10), n_epochs=20)
print(adata.obs['flowsom'].value_counts().sort_index())

flowsom
  4900
  5214
  5186
  3341
  3340
  1970
  1855
  3252
Name: count, dtype: int64

FlowSOM partitions the informative subset into 8 EV subpopulations of broadly comparable size. We also run Leiden graph-community detection — the single-cell standard — for comparison.

# graph-community detection is generic single-cell -> omicverse-native ov.pp.leiden
ov.pp.leiden(adata, resolution=0.3, key_added='leiden')
print('Leiden subpopulations :', adata.obs['leiden'].nunique())

🖥️ Using Scanpy CPU Leiden...
running Leiden clustering
finished (2.40s)
    found 28 clusters and added
    'leiden', the cluster labels (adata.obs, categorical)

╭─ SUMMARY: leiden ──────────────────────────────────────────────────╮
│  Duration: 2.4386s                                                 │
│  Shape:    29,058 x 40 (Unchanged)                                 │
│                                                                    │
│  CHANGES DETECTED                                                  │
│  ────────────────                                                  │
│  ● OBS    │ ✚ leiden (category)                                    │
│                                                                    │
│  ● UNS    │ ✚ leiden                                               │
│           │ └─ params: {'resolution': 0.3, 'random_state': 0, 'n_itera...│
│                                                                    │
╰────────────────────────────────────────────────────────────────────╯
Leiden subpopulations : 28

Leiden returns far more clusters than FlowSOM. This is expected and worth stating honestly: even on the ≥3-protein subset, PBA data is sparse and targeted, so the EV kNN graph fragments into many tiny communities. For sparse marker-panel single-EV data FlowSOM is the more robust choice, and we use the FlowSOM labels as the EV subpopulations for the rest of the notebook. Graph-community detection (Leiden) is itself generic, so we use the omicverse-native ov.pp.leiden.

8. UMAP embedding#

A UMAP gives a 2-D view of the subset. Read it as a coarse aid, not as crisp subpopulations. With a sparse 40-plex panel the embedding will look modest — the FlowSOM marker heatmap/dotplot in Section 10 is the primary, quantitative description of the subpopulation structure. The UMAP here simply shows that the FlowSOM labels occupy coherent regions and that a tetraspanin signal varies smoothly across the embedding. The embedding and its scatter are generic single-cell steps, so we use ov.pp.umap and ov.pl.embedding.

# UMAP + the embedding scatter are generic single-cell -> omicverse-native ov.pp / ov.pl
ov.pp.umap(adata)
ov.pl.embedding(adata, basis='X_umap', color='flowsom', frameon='small',
                title='EV subpopulations (FlowSOM)')
plt.show()

🔍 [2026-05-21 20:55:03] Running UMAP in 'cpu' mode...
🖥️ Using Scanpy CPU UMAP...

🔍 UMAP Dimensionality Reduction:
   Mode: cpu
   Method: umap
   Components: 2
   Min distance: 0.5
{'n_neighbors': 15, 'method': 'umap', 'random_state': 0, 'metric': 'euclidean', 'use_rep': 'scaled|original|X_pca', 'n_pcs': 30}
   🔍 Computing UMAP parameters...
   🔍 Computing UMAP embedding (classic method)...
✅ UMAP Dimensionality Reduction Completed Successfully!
   ✓ Embedding shape: 29,058 cells × 2 dimensions
   ✓ Results added to AnnData object:
     • 'X_umap': UMAP coordinates (adata.obsm)
     • 'umap': UMAP parameters (adata.uns)
✅ UMAP completed successfully.

╭─ SUMMARY: umap ────────────────────────────────────────────────────╮
│  Duration: 20.6621s                                                │
│  Shape:    29,058 x 40 (Unchanged)                                 │
│                                                                    │
│  CHANGES DETECTED                                                  │
│  ────────────────                                                  │
│  ● UNS    │ ✚ umap                                                 │
│           │ └─ params: {'a': np.float64(0.5830300203414425), 'b': np.f...│
│                                                                    │
│  ● OBSM   │ ✚ X_umap (array, 29058x2)                              │
│                                                                    │
╰────────────────────────────────────────────────────────────────────╯

../_images/f697f07d6c53d14c582f2b613c3bc941bf6cf260f621ebaafab356e6926ef8c8.png

# the same embedding coloured by a tetraspanin marker's signal
ov.pl.embedding(adata, basis='X_umap', color='CD63_C', cmap='magma',
                frameon='small', title='CD63 signal per EV')
plt.show()

../_images/cc48db17ea1455526d11acdcd4942172d92a21606d8b6b95b50f0fc67a2e7026.png

9. MISEV2023 marker classification#

classify_markers labels every protein in var with its MISEV2023 category — transmembrane/lipid-bound EV markers, cytosolic EV markers, co-isolated contaminants, organelle contaminants, or functional/cell-type/disease markers. It resolves antibody-barcode suffixes (CD9_A, CD63_C) and CD-antigen shorthand (CD107a → LAMP-1), so the panel’s genuine EV markers are recognised rather than dropped into 'other'.

ov.single.ev.classify_markers(adata)
print(adata.var['misev_category'].value_counts())

misev_category
other            25
transmembrane    10
functional        5
Name: count, dtype: int64

Of the 40 PBA proteins, 10 are recognised as transmembrane/lipid-bound EV markers — the tetraspanins (CD9, CD63, CD151), CD107a/LAMP-1, ADAM10 and integrin transmembrane subunits — and 5 as functional/cell-type markers (EGFR, EpCAM and other signalling/adhesion proteins). The remaining 25 fall in 'other': integrin subunits and CD-antigens not in the core MISEV panels. This reflects the PBA panel’s design focus on surface adhesion proteins, while still confirming a solid core of bona fide EV markers is present.

Tetraspanin EV subtypes#

annotate_ev_subtype assigns each vesicle to a tetraspanin-defined surface subset from CD9/CD63/CD81 positivity — single-, double-, triple-positive or tetraspanin-negative. MISEV2023 stresses tetraspanins are not universal EV markers, so the negative class is kept explicit rather than discarded. The PBA panel carries two CD9 and two CD63 antibody barcodes plus the tetraspanin CD151; we use one of each. The classifier resolves the _A/_C barcode suffixes to the underlying tetraspanin identity.

ov.single.ev.annotate_ev_subtype(
    adata, tetraspanins=['CD9_A', 'CD63_C', 'CD151'])
print(adata.obs['ev_subtype'].value_counts())

ev_subtype
tetraspanin-negative              13389
CD151-only                         6106
CD63_C-only                        3953
double-positive (CD63_C/CD151)     2264
CD9_A-only                         1733
double-positive (CD9_A/CD151)       686
double-positive (CD9_A/CD63_C)      641
triple-positive                     286
Name: count, dtype: int64

The subtypes now resolve correctly: a large tetraspanin-negative fraction (~45-50% of EVs) sits alongside substantial single-positive subsets — CD151-only is the largest, followed by CD63-only and CD9-only — a set of smaller double-positive subsets, and only a small triple-positive core. This is concrete single-vesicle confirmation of the MISEV2023 point that tetraspanins mark only a fraction of EVs, and that CD9, CD63 and CD151 largely label distinct vesicle subsets rather than all co-occurring.

10. Per-subpopulation marker proteins#

rank_markers identifies, for each EV subpopulation, the proteins enriched in that subpopulation versus all other EVs (Wilcoxon rank-sum, with effect size, log fold-change and BH-FDR). Together with the heatmap and dotplot below, this is the primary, quantitative description of the FlowSOM subpopulations.

markers = ov.single.ev.rank_markers(adata, groupby='flowsom', n_top=3)
markers[['group', 'protein', 'effect_size', 'log2fc', 'frac_in', 'padj']]

	group	protein	effect_size	log2fc	frac_in	padj
0	0	ITGA3	0.776338	NaN	0.567551	0.000000e+00
1	0	ITGB1	0.729780	NaN	0.505510	1.294781e-292
2	0	CD151	0.601613	2.051215	0.638571	0.000000e+00
3	1	CD13	0.938118	NaN	0.609705	0.000000e+00
4	1	ITGA9	0.776320	NaN	0.490986	0.000000e+00
5	1	CD107a	0.397167	2.304287	0.462601	0.000000e+00
6	2	CD166	0.793371	3.508573	0.631894	0.000000e+00
7	2	EpCAM	0.759111	3.339216	0.643656	0.000000e+00
8	2	ITGA6	0.575365	NaN	0.444466	6.241038e-83
9	3	ITGAL	0.683997	NaN	0.363963	0.000000e+00
10	3	ITGB2	0.576198	NaN	0.282251	0.000000e+00
11	3	CD90	0.390623	NaN	0.189165	0.000000e+00
12	4	CD63_D	0.681949	8.026542	0.515868	0.000000e+00
13	4	CD63_C	0.669449	NaN	0.445210	0.000000e+00
14	4	CD151	0.348081	1.206982	0.552695	4.174578e-199
15	5	CD166	0.905338	2.621310	0.811168	0.000000e+00
16	5	CD107a	0.812671	3.549264	0.634518	0.000000e+00
17	5	EGFR	0.341766	1.833653	0.437563	2.092213e-89
18	6	EpCAM	0.713413	2.331228	0.709973	2.010303e-257
19	6	CD151	0.528610	1.586918	0.666307	4.566145e-151
20	6	Del1	0.380142	3.628360	0.323450	8.681544e-27
21	7	ITGB5	0.440698	NaN	0.148831	0.000000e+00
22	7	ITGA8	0.400238	NaN	0.112239	0.000000e+00
23	7	ITGAE	0.387738	NaN	0.141759	0.000000e+00

# dot plot: dot colour = mean signal, dot size = fraction of EVs positive
top_proteins = list(dict.fromkeys(
    ov.single.ev.rank_markers(adata, groupby='flowsom', n_top=3)['protein']))
ov.pl.dotplot(adata, top_proteins, groupby='flowsom', use_raw=False,
              standard_scale='var', cmap='Reds')
plt.show()

../_images/8e7715430d0182abf78a2e573829365a898d3327730690f92de049fc00cc4dae.png

# protein x subpopulation mean-signal heatmap (per-subpopulation markers)
flowsom_markers = ov.single.ev.rank_markers(adata, groupby='flowsom', n_top=3)
# assign each protein to a single subpopulation for the heatmap row groups
seen, marker_dict = set(), {}
for grp, dfg in flowsom_markers.groupby('group'):
    marker_dict[grp] = [p for p in dfg['protein']
                        if not (p in seen or seen.add(p))]
marker_dict = {g: v for g, v in marker_dict.items() if v}
ov.pl.marker_heatmap(adata, marker_genes_dict=marker_dict, groupby='flowsom',
                     use_raw=False, standard_scale='var', figsize=(8, 5))
plt.show()

PyComplexHeatmap have been install version: 1.8.5
Starting..
Calculating row orders..
Reordering rows..
Calculating col orders..
Reordering cols..
Plotting matrix..
Inferred max_s (max size of scatter point) is: 524.8521583605069
Collecting legends..
Plotting legends..
Estimated legend width: 74.54333333333334 mm

../_images/c0f0a614ff56a803dc87f1680d21d5e73151f8aeda81bdc82be65b267490941f.png

The subpopulations carry distinct surface signatures — for example one is defined by the epithelial/tumor markers EpCAM and CD166, another by an integrin module (ITGA3 / ITGB1 / CD151). These are vesicle subpopulations, each with its own protein program.

EV-cargo enrichment#

marker_enrichment tests a subpopulation’s marker proteins against a curated EV-cargo reference (ExoCarta + Vesiclepedia) by the hypergeometric distribution — confirming the markers are bona-fide EV proteins.

ref = ov.datasets.ev_marker_reference()
sub1_markers = ov.single.ev.rank_markers(
    adata, groupby='flowsom', n_top=10)
sub1_markers = list(sub1_markers[sub1_markers['group'] == '1']['protein'])
enr = ov.single.ev.marker_enrichment(
    adata, markers=sub1_markers,
    reference={'Vesiclepedia/ExoCarta': ref['gene_symbol'].tolist()})
enr

🔍 Downloading data to ./data/ev_marker_reference.tsv.gz
⚠️ File ./data/ev_marker_reference.tsv.gz already exists

	reference	n_markers	n_reference	n_overlap	expected	fold_enrichment	pvalue	padj
0	Vesiclepedia/ExoCarta	10	26	7	6.5	1.076923	0.508059	0.508059

About 26 of the 40 assayed PBA proteins are documented EV cargo in Vesiclepedia/ExoCarta — the panel is overwhelmingly composed of genuine EV proteins, exactly as intended for an EV-surface assay.

11. Marker colocalization — the single-vesicle advantage#

This is the analysis bulk EV proteomics simply cannot do. Because each row is one physical vesicle, we can ask which markers co-occur on the same individual EV. colocalization computes, for every marker pair, the co-positive EV count, Jaccard index, odds ratio, observed/expected co-positivity and a BH-corrected Fisher’s-exact p-value.

coloc_markers = ['CD9_A', 'CD63_C', 'CD9_B', 'CD63_D', 'CD151', 'CD147']
coloc = ov.single.ev.colocalization(adata, markers=coloc_markers)
coloc[['markers', 'n_copos', 'jaccard', 'odds_ratio', 'obs_exp', 'padj']]

	markers	n_copos	jaccard	odds_ratio	obs_exp	padj
0	CD151+CD147	2127	0.177828	1.924497	1.394009	3.072239e-88
1	CD63_D+CD151	1843	0.146061	1.233433	1.119866	1.797236e-10
2	CD63_C+CD63_D	1545	0.216265	4.677923	2.456636	0.000000e+00
3	CD63_C+CD151	1351	0.116858	1.333241	1.177099	3.275218e-14
4	CD63_D+CD147	727	0.079558	0.820823	0.869535	9.999996e-01
5	CD9_B+CD151	639	0.059019	0.901771	0.935777	9.999996e-01
6	CD63_C+CD147	484	0.061798	0.781702	0.830071	9.999996e-01
7	CD9_B+CD63_D	427	0.062647	1.193729	1.141179	2.843672e-03
8	CD9_A+CD151	405	0.038965	0.804580	0.864612	9.999996e-01
9	CD9_B+CD147	310	0.047256	0.867796	0.893604	9.999996e-01
10	CD63_C+CD9_B	304	0.056401	1.211950	1.164975	4.402231e-03
11	CD9_A+CD63_D	280	0.044473	1.120447	1.090885	1.026759e-01
12	CD9_A+CD9_B	235	0.070233	2.621437	2.206579	5.211874e-31
13	CD9_A+CD147	201	0.033489	0.813438	0.844646	9.999996e-01
14	CD9_A+CD63_C	192	0.039710	1.090640	1.072603	2.552694e-01

ov.single.ev.colocalization_plot(coloc, value='obs_exp', cmap='magma')
plt.show()

../_images/e529a7465f1793ca5b02ccab9791d7cb06651adbd21ad56bd403e00a6eb54fec.png

The two independent CD63 antibody barcodes (CD63_C and CD63_D) co-localize on the same vesicles far above chance (odds ratio ~5, observed/expected ~2.8, p < 1e-300) — an internal positive control: two antibodies against the same protein should land on the same EV. CD151 and CD147 also co-occur above expectation (odds ratio ~2.1), pointing to a genuine co-presence of these two surface proteins on a shared vesicle subset. A bulk measurement would only report that all four proteins are “present”; single-EV resolution shows which travel together.

EV protein-signature combinations#

protein_combinations enumerates the exact marker signatures carried by individual EVs — the multi-marker combinations a vesicle is positive for.

combos = ov.single.ev.protein_combinations(
    adata, markers=['CD9_A', 'CD63_C', 'CD9_B', 'CD63_D'])
combos.head(8)

	combination	n_markers	n_ev	fraction
0	(none)	0	19495	0.670900
1	CD63_D	1	3105	0.106855
2	CD63_C	1	1759	0.060534
3	CD9_B	1	1358	0.046734
4	CD63_C+CD63_D	2	1353	0.046562
5	CD9_A	1	898	0.030904
6	CD9_B+CD63_D	2	265	0.009120
7	CD9_A+CD63_D	2	170	0.005850

Source-selective protein combinations — reproducing PBA Fig. 5a#

The original PBA study’s Fig. 5a is the centrepiece of its single-vesicle biology: it ranks multi-protein combinations — e.g. “CD151 & EpCAM”, “ADAM10 & CD166 & CD63” — by the number of exosomes carrying each, and singles out source-selective combinations: protein signatures that are far more frequent on the exosomes of one source than another. This is the differentially-expressed protein combination (DEPC) analysis, and it is exactly what bulk EV proteomics cannot do — a bulk assay reports the average abundance of each protein separately and is structurally blind to which proteins ride the same vesicle, let alone whether a combination is source-specific.

ov.single.ev.protein_combinations reproduces it. We restrict to two representative sources — K562 (chronic myeloid leukaemia exosomes) and Serum-1 (human-serum exosomes) — and a focused 8-marker surface panel (tetraspanins CD9/CD63/CD151, CD147, ADAM10, and the epithelial/signalling markers CD166, EpCAM, EGFR). With condition_key='sample' the function enumerates every protein combination carried by individual exosomes and runs a BH-corrected Fisher test per combination, returning a DEPC table ranked by log2 fold-change between the two sources.

# PBA Fig. 5a: source-selective protein combinations (DEPCs), K562 vs Serum-1
panel = ['CD9_A', 'CD63_C', 'CD151', 'CD147', 'ADAM10', 'CD166', 'EpCAM', 'EGFR']
pair = adata[adata.obs['sample'].isin(['K562', 'Serum-1'])].copy()
depc = ov.single.ev.protein_combinations(
    pair, markers=panel, condition_key='sample', reference='Serum-1', min_ev=20)
depc[['combination', 'n_markers', 'n_K562', 'n_Serum-1',
      'log2_fold_change', 'padj']].head(12)

	combination	n_markers	n_K562	n_Serum-1	log2_fold_change	padj
0	CD151+CD147+EpCAM	3	175	0	25.954637	4.691640e-50
1	CD63_C+CD151+ADAM10	3	74	0	24.712880	3.751532e-21
2	CD151+CD147+ADAM10	3	63	0	24.480706	4.649494e-18
3	CD63_C+CD151+CD147+EpCAM	4	43	0	23.929691	2.500738e-12
4	CD166+EGFR	2	0	27	-23.400758	3.034834e-09
5	CD63_C+CD151+ADAM10+EpCAM	4	28	0	23.310781	2.324267e-08
6	CD151+CD147+ADAM10+EpCAM	4	25	0	23.147283	1.506511e-07
7	CD63_C+CD151+EpCAM	3	99	1	6.486909	1.635325e-26
8	CD151+CD147	2	264	6	5.316987	2.429619e-66
9	CD63_C+CD151+CD147	3	84	2	5.249871	4.689734e-21
10	CD151+ADAM10+EpCAM	3	71	2	5.007301	1.918283e-17
11	CD63_C+CD151	2	198	9	4.316987	2.049317e-44

# ranked bar chart of the top source-selective multi-protein combinations
sel = depc[(depc['n_markers'] >= 2) & (depc['padj'] < 0.05)].copy()
sel = sel.reindex(sel['log2_fold_change'].abs().sort_values(ascending=False).index)
top = sel.head(15).iloc[::-1]
colors = ['#c0392b' if v > 0 else '#2c6fbb' for v in top['log2_fold_change']]
plt.figure(figsize=(7, 5.5))
plt.barh(top['combination'], top['log2_fold_change'], color=colors)
plt.axvline(0, color='k', lw=0.8)
plt.xlabel('log2 fold-change  (K562  vs  Serum-1)')
plt.title('Source-selective EV protein combinations (DEPCs, PBA Fig. 5a)')
plt.tight_layout()
plt.show()

../_images/d1939c0c7922738e69fa1125de3c76af69d989f43479ced1e0e2fc432932e496.png

The DEPC bar chart is the single-vesicle result at the heart of the PBA paper: specific multi-protein combinations are strongly enriched on the exosomes of one source over the other (red = enriched on K562 leukaemia exosomes, blue = enriched on serum exosomes). Several tetraspanin- and CD147-containing combinations come out as source-selective. Crucially, this is a statement about co-occurrence on the same vesicle — a bulk EV proteomic measurement of these two sources could only compare each protein’s average abundance one at a time and would never recover that a combination is source-specific.

A note on PBA Fig. 5b/c#

The PBA paper’s Fig. 5b/c is a separate spike-in dilution experiment — K562 / prostasome exosomes were deliberately diluted into serum at 10 % down to 0.01 % to test rare-EV detection. The ev_pba tutorial dataset used here is the 15-source cell-line / serum panel and does not contain that purpose-built dilution series, so Fig. 5b/c is not reproduced here. We do not synthesise a dilution: a fabricated series would not be an honest reproduction. The rare-EV-detection question is best explored on the original spike-in data from the paper’s supplement.

12. Differential analysis across conditions#

With 15 samples spanning cancer cell lines, normal cell lines and human serum, we can test what differs between conditions at single-EV resolution. differential_abundance tests each protein’s per-EV signal in cancer vs normal exosomes.

da = ov.single.ev.differential_abundance(
    adata, condition_key='condition', group_a='cancer', group_b='normal')
da.head(8)

	protein	log2fc	effect_size	mean_a	mean_b	n_a	n_b	pval	padj
0	CD13	-3.654837	-0.279692	0.012996	0.163685	23438	5620	2.225411e-145	8.901643e-144
1	Del1	-5.529277	-0.307167	0.004670	0.215661	23438	5620	7.450365e-138	1.490073e-136
2	ITGA9	NaN	-0.263860	-0.015500	0.114722	23438	5620	3.055329e-132	4.073772e-131
3	CD151	1.574638	0.266216	0.292459	0.098186	23438	5620	6.508558e-77	6.508558e-76
4	CD107a	-0.571882	-0.081540	0.107960	0.160479	23438	5620	2.332654e-50	1.866123e-49
5	ITGA3	NaN	0.274308	0.108291	-0.029493	23438	5620	1.440969e-48	9.606462e-48
6	CD26	-0.729062	-0.085754	0.076190	0.126289	23438	5620	4.504399e-39	2.573942e-38
7	ITGAM	1.423498	-0.173165	-0.077812	-0.029009	23438	5620	1.997133e-32	9.985667e-32

differential_subpopulation asks whether the frequencies of the EV subpopulations shift between conditions — a replicate-aware test when a sample_key is supplied.

ds = ov.single.ev.differential_subpopulation(
    adata, condition_key='condition', cluster_key='flowsom',
    group_a='cancer', group_b='normal', sample_key='sample')
ds

	cluster	frac_a	frac_b	delta_frac	log2_ratio	stat	test	pval	padj
0	2	0.206491	0.067462	0.139029	1.613922	1.738558	welch_t	0.129163	0.516652
1	4	0.120624	0.048954	0.071670	1.301011	1.789924	welch_t	0.124104	0.516652
2	5	0.081949	0.033329	0.048620	1.297951	1.208636	welch_t	0.251299	0.619645
3	7	0.105697	0.180608	-0.074911	-0.772933	-1.048232	welch_t	0.387278	0.619645
4	1	0.149615	0.352587	-0.202972	-1.236727	-1.226228	welch_t	0.336561	0.619645
5	0	0.150515	0.085995	0.064520	0.807583	0.700297	welch_t	0.515748	0.687664
6	6	0.048258	0.108660	-0.060402	-1.170970	-0.569657	welch_t	0.624426	0.713630
7	3	0.136852	0.122405	0.014447	0.160954	0.221211	welch_t	0.832657	0.832657

# EV-subpopulation composition of every sample
ov.pl.cellproportion(adata, celltype_clusters='flowsom', groupby='sample',
                     figsize=(8, 4), legend=True)
plt.show()

../_images/84c53b1258576618d82f2c3d1e36a590ab0ebda9d4cb6ad9aeb8448375596ca3.png

The stacked-bar composition shows that different parental cell lines shed exosome populations with markedly different subpopulation mixtures — each cell line has a characteristic single-EV “fingerprint”.

13. Pseudo-bulk biomarker discovery#

Collapsing the per-EV profiles of the informative subset to a sample x protein matrix recovers a classic bulk measurement, on which moderated-t differential expression can be run for biomarker discovery — exploiting the 15 samples as replicates.

pb = ov.single.ev.pseudobulk(
    adata, sample_key='sample', condition_key='condition')
print('pseudo-bulk matrix :', pb.shape, '(samples x proteins)')
pb.obs[['n_evs', 'condition']]

pseudo-bulk matrix : (15, 40) (samples x proteins)

	n_evs	condition
sample
A549	2274	cancer
AGS	1860	cancer
BLC21	1709	cancer
Daudi	1964	cancer
HCT116	1457	cancer
HEK293	1952	normal
K562	2691	cancer
MM1	2619	cancer
MKN45	1051	cancer
MKN7	1703	cancer
PC3	2378	cancer
SK-N-SH	1010	cancer
Serum-1	2438	normal
Serum-2	1230	normal
U87MG	2722	cancer

pde = ov.single.ev.pseudobulk_de(
    pb, condition_key='condition', group_a='cancer', group_b='normal',
    method='moderated_t')
pde.head(8)

	protein	log2fc	mean_a	mean_b	t	pval	padj
0	ITGA1	-1.905111	0.000000	1.320522	-2.607334	0.018399	0.245333
1	ITGB3	-2.602239	0.265283	2.069018	-2.879828	0.010397	0.245333
2	ITGAM	-1.824594	0.000000	1.264712	-2.607302	0.018400	0.245333
3	CD63_D	4.542142	3.148373	0.000000	2.420784	0.026967	0.269671
4	ITGAL	-2.407844	0.497047	2.166038	-1.668069	0.113615	0.522662
5	CD318	3.877272	2.687520	0.000000	1.749008	0.098320	0.522662
6	ITGA3	3.714959	2.575014	0.000000	1.534304	0.143355	0.522662
7	CD151	4.245399	5.142744	2.200058	1.877367	0.077736	0.522662

14. MISEV2023 characterization report#

Finally, misev_report assembles a MISEV2023-aligned characterization report. It is run on the raw counts — at this point adata.X holds the CLR-normalized matrix, so we restore the raw counts into a copy for an interpretable report. The report covers the informative subset analysed throughout the embedding/marker sections.

adata_raw = adata.copy()
adata_raw.X = adata_raw.layers['counts']
report = ov.single.ev.misev_report(adata_raw)
print('--- MISEV2023 report ---')
for k, v in report['meta'].items():
    print(f'  {k:18s}: {v}')
for k, v in report['summary'].items():
    print(f'  {k:24s}: {v}')

--- MISEV2023 report ---
  n_evs             : 29058
  n_proteins        : 40
  value_type        : count
  platform          : unknown
  n_positive_markers      : 10
  n_contaminant_markers   : 0
  n_other_markers         : 30
  positive_signal         : 1.8102071718631703
  contaminant_signal      : 0.0
  purity_score            : 1.0
  mean_proteins_per_ev    : 4.205692064147567
  mean_total_signal_per_ev: 7.970851400646982

ov.single.ev.misev_marker_plot(adata_raw)
plt.show()

../_images/05f12f68bfdf0f01f8dcfee4025504b6cf653614dab5014ee45316a7135c01cb.png

Synthesis#

Working from real Proximity Barcoding Assay data we ran the complete ov.single.ev single-EV proteomics pipeline and reproduced the signature figures of Wu et al., Nat Commun 2019:

QC removed ~28,000 low-information EVs, leaving ~47,000 EVs x 40 surface proteins across 15 sources.
PBA Fig. 4a — the source x protein pseudo-bulk heatmap — was reproduced with ov.single.ev.pseudobulk + ov.pl.group_heatmap, showing which source carries which surface protein.
PBA Fig. 4b — three t-SNE panels stratified by detected-protein count — was reproduced faithfully: 1-protein exosomes give a radial “firework” with no source separation, 2-protein exosomes stay ambiguous, and only the >=3-protein subset (~29,000 EVs) resolves by source. The rest of the pipeline runs on that informative subset.
FlowSOM partitioned the subset into 8 EV subpopulations, each with a distinct surface-protein program (an EpCAM/CD166 epithelial subset, an integrin-module subset, others). The FlowSOM marker heatmap/dotplot — not the 2-D embedding — is the primary structural readout; Leiden over-fragmented, so FlowSOM is the robust choice here.
MISEV2023 marker classification recognised 10 transmembrane EV markers and 5 functional markers; tetraspanin subtyping showed a large tetraspanin-negative fraction alongside CD151-, CD63- and CD9-defined single-positive subsets and only a small triple-positive core — single-vesicle confirmation of the MISEV2023 caveat.
Marker colocalization showed the two CD63 barcodes co-occur on the same vesicles far above chance (an internal control) and that CD151/CD147 genuinely co-localize on a shared vesicle subset.
PBA Fig. 5a — the source-selective protein-combination (DEPC) analysis — was reproduced with ov.single.ev.protein_combinations: specific multi-protein combinations are enriched on K562 leukaemia exosomes versus serum exosomes. The paper’s Fig. 5b/c spike-in dilution is not reproduced — the tutorial dataset has no dilution series and we do not fabricate one.
Differential analysis across cancer/normal exosomes and a 15-sample pseudo-bulk moderated-t test surfaced condition-specific surface proteins for biomarker follow-up.

The key idea: a single-EV dataset is an EV x protein matrix, so the single-cell toolkit applies almost unchanged — but single-vesicle resolution unlocks colocalization and source-selective protein combinations, the question of which markers ride the same vesicle, that no bulk assay can answer.

The companion notebook Single-EV proteomics — imaging/intensity modality runs the same ov.single.ev API on a different platform (MASEV cyclic immunofluorescence, value_type='binary'), showing the module is platform-agnostic.