Genomics

Genomics#

A systematic, best-practice GWAS pipeline built on the omicverse.genetics module — a unified statistical-genetics framework that threads genotype, expression and complex traits into one analysis.

This chapter is a three-notebook end-to-end workflow, not a catalogue of methods.

The first two notebooks run entirely on real public data — exposed through ov.datasets loaders — so each result is what the data actually show, and each step feeds the next:

From genotypes to a fine-mapped locus — a real cis-eQTL association study on the GEUVADIS cohort (462 1000-Genomes individuals with real chr22 genotypes and lymphoblastoid-cell-line RNA-seq). Sample QC, variant QC, population-structure correction by genotype PCA (the real European/African split), a cis-eQTL screen that picks a gene with a strong signal, a PC-adjusted association scan, genomic-inflation and Q-Q / Manhattan diagnostics, and SuSiE statistical fine-mapping to a 95% credible set.
From a GWAS hit to a mechanism — functional follow-up of the real blood lymphocyte-count GWAS of Astle et al. 2017 (GWAS Catalog GCST004627, N ≈ 173k). Manhattan + genomic-inflation overview, SNP-heritability by LD score regression, Bayesian colocalization of the GWAS against real GTEx v8 whole-blood cis-eQTLs, transcriptome-wide association (S-PrediXcan TWAS), Mendelian randomization with MR-Egger pleiotropy sensitivity, and single-cell disease-relevance scoring (scDRS) on a real PBMC atlas to find the disease-relevant cell type.

Every step opens with the rationale — the why, the standard thresholds and how to interpret the result — and reports the real numbers honestly, caveats and all, so the chapter reads as a GWAS protocol on real data. The statistical engines are the standalone R-parity packages pymatrixeqtl, pysusie, pycoloc, pytwosamplemr, pyscdrs, pyldsc and pytwas; the GWAS core (QC, association, genomic inflation) needs no backend.

From a GWAS hit to spatially resolved mapping — spatially resolved GWAS with gsMap on the official MOSTA embryo dataset (E16.5_E1S1.MOSTA.h5ad, 121,767 spots). The pipeline learns a latent representation from expression and spatial structure, maps it back to gene-level specificity scores (GSS), connects pre-computed LD resources, runs spatial-LDSC with real GWAS summary statistics (IQ_NG_2018), aggregates spot-level p-values by Cauchy combination, and visualises the results with native OmicVerse plotting methods.