Marker peaks and differential accessibility

Marker peaks and differential accessibility#

Once cells are grouped (by cluster or annotated cell type), the natural question is which regulatory regions define each group. ov.epi.tl.find_marker_features runs an ArchR-style test — a Wilcoxon test of each peak against a bias-matched background of the other cells — to rank the marker peaks of every group.

We use the annotated 10x PBMC 10k multiome ATAC peak matrix (cell_type already provided) so the marker peaks can be read against known cell types:

rank marker peaks per cell type (ov.epi.tl.find_marker_features)
inspect the top markers and annotate them to their nearest gene
visualise one cell type’s differential accessibility as a volcano plot
summarise top marker accessibility across cell types as a heatmap

import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import omicverse as ov
ov.epi.pl.plot_set()
print('omicverse', ov.__version__)

└─ 🔬 Starting plot initialization...
  ├─ Apply Scanpy/matplotlib settings
  ├─ Custom font setup
  ├─ Suppress warnings
  ├─ 
___________      .__                      
\_   _____/_____ |__| ____   ____   ____  
 |    __)_\____ \|  |/  _ \ /    \_/ __ \ 
 |        \  |_> >  (  <_> )   |  \  ___/ 
/_______  /   __/|__|\____/|___|  /\___  >
        \/|__|                  \/     \/ 

  ├─ 🔖 Version: 0.0.1rc1   📚 Tutorials: https://epione.readthedocs.io/
└─ ✅ plot_set complete.

omicverse 2.2.1rc1

1 · Load the annotated ATAC peak matrix#

atac, _ = ov.epi.datasets.pbmc10k_multiome()
print('ATAC peak matrix:', atac.shape)
atac.obs['cell_type'].value_counts()

ATAC peak matrix: (9631, 107194)

cell_type
CD14 Mono         2554
CD4 Naive         1382
CD8 Naive         1354
CD4 TCM           1113
CD16 Mono          442
NK                 403
CD8 TEM_1          322
CD8 TEM_2          315
Intermediate B     300
Memory B           298
CD4 TEM            286
cDC                180
Treg               157
gdT                143
MAIT               130
Naive B            125
pDC                 98
HSPC                17
Plasma              12
Name: count, dtype: int64

2 · Marker peaks per cell type#

find_marker_features returns a tidy table — one row per (group, peak) — with effect size (log2_fc), detection fractions, and significance (fdr).

markers = ov.epi.tl.find_marker_features(atac, group_by='cell_type')
print('marker table:', markers.shape)
markers.head()

└─ [find_marker_features] group='CD4 Naive' | fg=1,382 cells | bg=1,382 cells (matched=no)
└─ [find_marker_features] group='CD4 TCM' | fg=1,113 cells | bg=1,113 cells (matched=no)
└─ [find_marker_features] group='CD4 TEM' | fg=286 cells | bg=286 cells (matched=no)
└─ [find_marker_features] group='CD8 Naive' | fg=1,354 cells | bg=1,354 cells (matched=no)
└─ [find_marker_features] group='CD8 TEM_1' | fg=322 cells | bg=322 cells (matched=no)
└─ [find_marker_features] group='CD8 TEM_2' | fg=315 cells | bg=315 cells (matched=no)
└─ [find_marker_features] group='CD14 Mono' | fg=2,000 cells | bg=2,000 cells (matched=no)
└─ [find_marker_features] group='CD16 Mono' | fg=442 cells | bg=442 cells (matched=no)
└─ [find_marker_features] group='HSPC' | fg=17 cells | bg=17 cells (matched=no)
└─ [find_marker_features] group='Intermediate B' | fg=300 cells | bg=300 cells (matched=no)
└─ [find_marker_features] group='MAIT' | fg=130 cells | bg=130 cells (matched=no)
└─ [find_marker_features] group='Memory B' | fg=298 cells | bg=298 cells (matched=no)
└─ [find_marker_features] group='NK' | fg=403 cells | bg=403 cells (matched=no)
└─ [find_marker_features] group='Naive B' | fg=125 cells | bg=125 cells (matched=no)
└─ [find_marker_features] group='Plasma' | fg=12 cells | bg=12 cells (matched=no)
└─ [find_marker_features] group='Treg' | fg=157 cells | bg=157 cells (matched=no)
└─ [find_marker_features] group='cDC' | fg=180 cells | bg=180 cells (matched=no)
└─ [find_marker_features] group='gdT' | fg=143 cells | bg=143 cells (matched=no)
└─ [find_marker_features] group='pDC' | fg=98 cells | bg=98 cells (matched=no)
marker table: (2036686, 13)

	group	feature	mean_fg	mean_bg	pct_fg	pct_bg	log2_fc	U	z	p_value	fdr	n_fg	n_bg
0	CD4 Naive	chr1:816881-817647	0.017366	0.104197	0.009407	0.047033	-2.584893	918951.5	-5.984385	2.172087e-09	1.390224e-08	1382	1382
1	CD4 Naive	chr1:819912-823500	0.103473	0.151954	0.047757	0.068017	-0.554370	935616.5	-2.279238	2.265292e-02	3.937566e-02	1382	1382
2	CD4 Naive	chr1:825827-825889	0.002171	0.000000	0.002171	0.000000	11.084654	957035.0	1.732678	8.315291e-02	1.252705e-01	1382	1382
3	CD4 Naive	chr1:826612-827979	0.397250	0.463097	0.184515	0.205499	-0.221265	933151.0	-1.511729	1.306028e-01	1.867460e-01	1382	1382
4	CD4 Naive	chr1:841243-843059	0.117945	0.035456	0.064399	0.019537	1.733990	997784.5	5.876457	4.191396e-09	2.571794e-08	1382	1382

Top markers for a few cell types#

Filtering to significant, up-accessible peaks (fdr < 0.05, log2_fc > 0) and taking the top by effect size gives each cell type’s most specific regulatory regions.

sig = markers[(markers['fdr'] < 0.05) & (markers['log2_fc'] > 0)]
top = (sig.sort_values('log2_fc', ascending=False)
          .groupby('group').head(5))
for ct in ['CD14 Mono', 'Naive B', 'NK']:
    if ct in set(top['group']):
        print(f'\n=== {ct} top marker peaks ===')
        print(top[top['group'] == ct][['feature', 'log2_fc', 'pct_fg', 'pct_bg', 'fdr']]
              .to_string(index=False))

=== CD14 Mono top marker peaks ===
                  feature   log2_fc  pct_fg  pct_bg          fdr
      chr18:400354-400952 17.587843  0.0950     0.0 4.203895e-44
 chr2:126831558-126832185 17.074320  0.0635     0.0 1.981806e-29
 chr9:124143849-124144330 16.884659  0.0650     0.0 4.058194e-30
chr10:116465495-116465971 16.686897  0.0600     0.0 7.918512e-28
  chr18:25299007-25299800 16.587851  0.0485     0.0 1.317443e-22

=== Naive B top marker peaks ===
                 feature   log2_fc  pct_fg  pct_bg          fdr
  GL000205.2:67888-69629 20.919981   0.552     0.0 3.561571e-18
 chr11:60455290-60456088 20.652287   0.544     0.0 6.521199e-18
  chr5:78727641-78728723 20.595142   0.504     0.0 3.093037e-16
 chr18:76980792-76981819 19.896523   0.368     0.0 3.864268e-11
chr7:139714137-139715063 19.747145   0.320     0.0 1.753880e-09

=== NK top marker peaks ===
                 feature   log2_fc   pct_fg  pct_bg          fdr
chr1:162411943-162412452 19.116138 0.240695     0.0 4.920916e-23
 chr10:34348914-34349332 18.956415 0.215881     0.0 1.906227e-20
 chr15:69499406-69499933 18.823830 0.220844     0.0 5.866042e-21
 chr15:98952272-98953616 18.816095 0.178660     0.0 1.082036e-16
chr1:162413846-162414255 18.616787 0.208437     0.0 1.096585e-19

3 · Annotate top marker peaks to their nearest gene#

A marker peak is most interpretable next to a gene name. We assign each top peak to the nearest gene TSS using the Gencode annotation.

import pyranges as pr
g = pr.read_gff3(str(ov.epi.data.hg38.annotation)).df
gene_ann = g[g.Feature == 'gene'][['gene_name', 'Chromosome', 'Start', 'End', 'Strand']].copy()
gene_ann.columns = ['gene_name', 'chrom', 'start', 'end', 'strand']
gene_ann['tss'] = np.where(gene_ann['strand'] == '+', gene_ann['start'], gene_ann['end'])

def nearest_gene(peak):
    chrom, rng = peak.split(':'); s, e = rng.split('-')
    center = (int(s) + int(e)) // 2
    sub = gene_ann[gene_ann['chrom'] == chrom]
    if sub.empty:
        return ('NA', np.nan)
    i = (sub['tss'] - center).abs().values.argmin()
    row = sub.iloc[i]
    return (row['gene_name'], int(abs(row['tss'] - center)))

top = top.copy()
top[['nearest_gene', 'dist_to_tss']] = [nearest_gene(p) for p in top['feature']]
top[top['group'].isin(['CD14 Mono', 'Naive B', 'NK'])][
    ['group', 'feature', 'log2_fc', 'nearest_gene', 'dist_to_tss']].head(15)

	group	feature	log2_fc	nearest_gene	dist_to_tss
1500690	Naive B	GL000205.2:67888-69629	20.919981	NA	NaN
1410865	Naive B	chr11:60455290-60456088	20.652287	MS4A1	156.0
1473922	Naive B	chr5:78727641-78728723	20.595142	ENSG00000254170	12100.0
1440859	Naive B	chr18:76980792-76981819	19.896523	ENSG00000266844	6616.0
1488361	Naive B	chr7:139714137-139715063	19.747145	TBXAS1	62450.0
1293369	NK	chr1:162411943-162412452	19.116138	SH2D1B	59.0
1298310	NK	chr10:34348914-34349332	18.956415	ELOBP4	139755.0
1320577	NK	chr15:69499406-69499933	18.823830	DRAIC	36749.0
1321922	NK	chr15:98952272-98953616	18.816095	PGPEP1L	54848.0
1293370	NK	chr1:162413846-162414255	18.616787	SH2D1B	1912.0
688666	CD14 Mono	chr18:400354-400952	17.587843	ENSG00000265477	23763.0
700059	CD14 Mono	chr2:126831558-126832185	17.074320	TEX51	66992.0
746299	CD14 Mono	chr9:124143849-124144330	16.884659	ENSG00000234921	111565.0
657967	CD14 Mono	chr10:116465495-116465971	16.686897	HMGB3P8	25689.0
689404	CD14 Mono	chr18:25299007-25299800	16.587851	ENSG00000264434	2722.0

4 · Volcano plot for one cell type#

Plotting effect size against significance for all peaks tested in a single cell type (here CD14 monocytes) gives the familiar volcano view of differential accessibility.

# Differential accessibility for one cell type as an omicverse volcano plot.
ct = 'CD14 Mono'
res = markers[markers['group'] == ct].copy()
res = res.set_index('feature')
res['sig'] = np.where((res['fdr'] < 0.05) & (res['log2_fc'] > 1), 'up',
              np.where((res['fdr'] < 0.05) & (res['log2_fc'] < -1), 'down', 'normal'))
ov.pl.volcano(res, pval_name='fdr', fc_name='log2_fc', pval_threshold=0.05,
              fc_max=1.0, fc_min=-1.0, plot_genes_num=8,
              title=f'{ct} - differential accessibility')
plt.show()

🌋 Volcano Plot Analysis:
   Total genes: 107194
   ↗️  Upregulated genes: 39703
   ↘️  Downregulated genes: 23391
   ➡️  Non-significant genes: 44100
   🎯 Total significant genes: 63094
   log2_fc range: -17.75 to 17.59
   fdr range: 0.00e+00 to 1.00e+00

⚙️  Current Function Parameters:
   Data columns: pval_name='fdr', fc_name='log2_fc'
   Thresholds: pval_threshold=0.05, fc_max=1.0, fc_min=-1.0
   Plot size: figsize=(4, 4)
   Gene labels: plot_genes_num=8, plot_genes_fontsize=10
   Custom genes: None (auto-select top genes)

💡 Parameter Optimization Suggestions:
   ▶ Wide fold change range detected:
     Current: fc_max=1.0, fc_min=-1.0
     Suggested: fc_max=3.8, fc_min=-4.0

   📋 Copy-paste ready function call:
   ov.pl.volcano(result, pval_name='fdr', fc_name='log2_fc', fc_max=3.8, fc_min=-4.0)
────────────────────────────────────────────────────────────

../_images/65ac99fd48239a71df02a078cbfca6780cf982af3cdbd9ebe35b2e36588ff96b.png

5 · Marker accessibility dotplot#

Taking the top marker peaks across cell types and rendering them as an omicverse dotplot produces a block-diagonal pattern — each set of marker peaks is specific to its cell type.

# Top marker peaks per cell type as an omicverse dotplot (peaks as features).
topn = (sig.sort_values('log2_fc', ascending=False).groupby('group').head(5))
peaks = topn['feature'].unique().tolist()
ov.pl.dotplot(atac, var_names=peaks, groupby='cell_type', use_raw=False,
              standard_scale='var', show=False)
plt.show()

../_images/fc539d323eb5bdecc49b3521ce4f19531eeea16de8179737387a98c088522671.png

Summary#

stage	function
marker peaks per group	`ov.epi.tl.find_marker_features`
peak → nearest gene	Gencode annotation (`ov.epi.data.hg38.annotation`)
differential accessibility	volcano of `log2_fc` vs `fdr`

find_marker_features works equally on a peak matrix, a gene-score matrix or gene expression — so the same call ranks marker peaks, marker gene-activities, or marker genes. For two-group designs with replicates, ov.epi.tl.differential_peaks offers a pseudobulk pyDESeq2 test.