AlphaGenome Explained: Variant Effect Prediction at Scale (2026)

AlphaGenome variant effect prediction is the regulatory-genomics analogue of what AlphaFold did for structural biology. Google DeepMind released AlphaGenome in June 2025 as a unified sequence-to-function model that ingests up to 1 megabase (1,048,576 bp) of raw DNA context, predicts roughly 5,000 functional tracks at base-pair resolution across hundreds of human and mouse tissues, and emits calibrated variant-effect deltas for any substitution, indel, or short structural change inside that window. Before AlphaGenome, the field cycled through Basenji, ExPecto, DeepSEA, and Enformer — useful but each capped at narrower context, fewer modalities, or coarser resolution. The interesting question is no longer “can a transformer predict regulatory activity?” — it is whether AlphaGenome’s predictions are accurate enough that the bottleneck for non-coding variant interpretation moves from in-silico modeling to wet-lab validation throughput.

Architecture at a glance

AlphaGenome Explained: Variant Effect Prediction at Scale (2026) — architecture diagram — Architecture diagram — AlphaGenome Explained: Variant Effect Prediction at Scale (2026)

This post walks through the biology problem, the architecture, the training corpus, the published benchmarks, and what the model actually changes for clinical variant labs, CRISPR screen prioritization, and generative-genomics pipelines. The thesis: AlphaGenome is the AlphaFold moment for regulatory genomics, and the rate-limiting step has now shifted to functional validation.

Why non-coding variant effect prediction is the bottleneck

Roughly 98% of disease-associated genetic variants discovered through genome-wide association studies (GWAS) and whole-genome sequencing lie outside protein-coding regions, in promoters, enhancers, splice sites, untranslated regions, and intergenic regulatory elements. The clinical interpretation pipeline — best captured by the ACMG/AMP variant classification framework — was built around coding variants where the genetic code, conservation, and protein-structure tools give strong priors. For non-coding variants, the same framework collapses to “PM2 / BP4 / uncertain” categories most of the time. The ClinVar database as of early 2026 contains over 3.5M submitted variants, and the variant-of-uncertain-significance (VUS) rate for non-coding regions sits above 80%.

The reason this matters at population scale: the UK Biobank, All of Us, and Genomics England 100,000 Genomes cohorts together have sequenced more than 1.5 million genomes. Each genome carries roughly 4–5 million variants vs. GRCh38, and a typical individual carries 50–100 rare non-coding variants in regulatory elements of known disease genes. Without a model that scores those variants for likely regulatory disruption, every one is a candidate for downstream interpretation labor that does not scale.

The prior generation of models established the recipe. DeepSEA (Zhou and Troyanskaya, 2015) was the first deep model that took 1 kb DNA context and predicted 919 chromatin features. Basenji and Basenji2 (David Kelley et al., 2018–2020) pushed context to 131 kb and added cap analysis of gene expression (CAGE) heads. Enformer (Avsec et al., 2021) jumped to 196,608 bp context using attention layers and became the de-facto baseline for in-silico mutagenesis on non-coding variants. ExPecto and its successors layered tissue-specific expression on top. Each of these worked, but each hit a ceiling: 200 kb is not enough to capture distal enhancer-gene interactions across topologically associating domains, base resolution was lost to 128 bp bins, and splicing was handled by a separate model (SpliceAI) trained independently.

AlphaGenome’s claim is that a single model, trained jointly across modalities at base resolution with 1 Mb context, beats the specialist models on their own benchmarks while also producing internally consistent multi-modality predictions for any variant.

The AlphaGenome reference architecture

AlphaGenome is a U-Net-style sequence encoder with a transformer trunk and a fan-out of task-specific decoder heads, all trained end-to-end on roughly 5,000 functional tracks spanning chromatin accessibility, histone marks, transcription factor binding, RNA-seq expression, CAGE, polyA signals, and splicing. The encoder consumes one-hot encoded DNA at 1 bp resolution and progressively downsamples through 1D convolutions; the trunk reasons over long-range dependencies; the heads upsample back to the resolution each modality actually needs, ranging from base-pair (splice junctions) to 128 bp (chromatin) to gene-level (expression).

A few specifics worth pinning down from the AlphaGenome preprint and DeepMind technical report. The input window is 1,048,576 bp (1 Mb), which is large enough to contain the median topologically associating domain (TAD) — the unit of regulatory insulation in mammalian genomes — and the great majority of known enhancer-promoter pairs. The encoder uses six convolutional stages with progressive 2× downsampling, producing an internal sequence at 128 bp tokens before the transformer layers. The transformer uses rotary positional embeddings and operates on roughly 8,192 tokens for the full window, with bidirectional attention because regulatory grammar reads in both directions.

The decoder side is where the multi-modality story plays. Each modality has its own head architecture matched to the biology:

Chromatin accessibility (DNase-seq, ATAC-seq) — 128 bp resolution, ~700 cell-type tracks. Output is signal coverage.
Histone modifications (ChIP-seq for H3K27ac, H3K4me3, H3K27me3, etc.) — 128 bp resolution, ~2,000 tracks across ENCODE and Roadmap Epigenomics samples.
Transcription factor binding — 128 bp resolution, hundreds of TF/cell-type combinations.
CAGE (cap analysis of gene expression) — bp-resolution promoter activity at transcription start sites, drawn from FANTOM5.
RNA-seq coverage — 128 bp resolution across GTEx-derived tissue tracks.
Splice site usage and junction strength — base-pair resolution, replacing the need for a separate SpliceAI call.
Contact maps (Micro-C, Hi-C) — 2D output for predicted 3D folding within the window.

The eleven modality families and the way they share the encoder are the defining design choice. Models like SpliceAI predicted splicing in isolation; AlphaGenome learns splicing in the context of chromatin and TF binding, which is biologically correct because splicing is co-transcriptional and regulated by chromatin state. The cross-modal regularization tightens predictions for any single modality on its own benchmark — this is the same multi-task lesson that drove the AlphaFold 2 → AlphaFold 3 progression, covered in the AlphaFold 3 protein-ligand cofolding architecture deep dive.

The U-Net design choice is also worth understanding in its own right. A pure transformer over 1 Mb of base-resolution input would require attention over a million tokens, which is computationally infeasible with standard attention. A pure convolution stack would lose long-range coupling. The U-Net topology resolves this by letting convolutions handle local feature extraction at high resolution, dropping into a transformer at a tokenized middle resolution where long-range attention is tractable, then projecting back out to whatever resolution each downstream head needs. The skip connections from encoder to head let local sequence features bypass the transformer entirely when biology demands base-level precision — most importantly at splice sites, where individual base changes can flip a junction’s classification. This is the same architectural pattern that has dominated medical image segmentation for nearly a decade, applied to a 1D biological signal.

Training compute is substantial but not extreme by frontier-model standards. The full AlphaGenome model is on the order of hundreds of millions of parameters — small compared to a language model, large for a domain-specific scientific model — and training runs on the order of weeks on a TPU pod. That puts the model in the regime where it is reproducible by well-funded academic groups and large pharma, even if the training recipe is non-trivial to reproduce exactly. The released inference weights and code mean nobody needs to retrain to use the model; the open question is whether the field will see independent retrainings on alternative training corpora to test the model’s data-attribution properties.

Variant scoring as a difference of forward passes

The variant effect score is the explicit point of the model. AlphaGenome computes a variant effect by running two forward passes — one with the reference allele centered in the window, one with the alternate allele — and taking the per-modality, per-track difference of predicted signal. The deltas are then summarized into directional scores (gain or loss of expression, enhancer activation, splice disruption, etc.) and aggregated into a small set of variant-effect descriptors that downstream tools consume.

A worked example: a single nucleotide variant in a hypothetical liver-specific enhancer 50 kb upstream of a target gene. AlphaGenome centers a 1 Mb window on the variant, runs the reference forward pass, swaps the central base to the alternate allele, runs the alternate forward pass, and subtracts. The output is a per-track delta tensor: maybe -0.42 standard deviations of HepG2 H3K27ac signal at the variant locus, -0.18 standard deviations of accessibility, and a -0.31 delta on the predicted CAGE signal at the downstream target gene’s promoter. Those three quantities together — local chromatin loss plus distal expression loss with the right tissue specificity — are exactly the signature a clinical curator would want to see before flagging a rare regulatory variant as likely pathogenic.

The pipeline contract for downstream tools is straightforward. Given a VCF row (chromosome, position, ref, alt, optional tissue context), AlphaGenome returns a vector of per-modality effect scores plus the raw track-level deltas if needed. Most clinical applications consume the summary scores; CRISPR screen prioritization and generative-genomics applications consume the full delta tensor.

# Pseudocode — AlphaGenome variant scoring contract
from alphagenome import Model, Variant

model = Model.load("alphagenome-1mb-v1")

variant = Variant(
    chrom="chr19",
    pos=11_200_138,           # GRCh38 coordinate
    ref="C",
    alt="T",
    tissue_context=["liver", "HepG2"],
)

scores = model.score_variant(
    variant,
    window_bp=1_048_576,
    modalities=["rna", "cage", "h3k27ac", "dnase", "splice"],
)

# scores.summary: dict of high-level descriptors
# scores.tracks:  per-track delta arrays (numpy)
print(scores.summary["expression_delta_zscore"])  # -3.1
print(scores.summary["splice_disruption_score"])  # 0.04 (low)

The contract above is pseudocode that mirrors DeepMind’s published API surface but with idealized field names; consult the official AlphaGenome documentation on GitHub for the canonical interface.

One implementation detail worth flagging: AlphaGenome’s variant deltas are computed by re-running the entire 1 Mb window for the alternate allele, not by linearly perturbing the reference forward pass. This matters because the transformer trunk is non-linear in long-range context; a single base change at position p can propagate to predicted signal at position p ± 500 kb in non-trivial ways. The pair-of-forward-passes design is what gives the model its distal sensitivity, but it also means variant scoring throughput is exactly half of inference throughput per GPU, with no shortcut from caching encoder activations across REF and ALT.

The other practical operational point is normalization. Raw track-level deltas are not interpretable without context — a delta of 0.1 means very different things for a high-signal track (active promoter, range ~10) versus a low-signal track (silent intergenic region, range ~0.2). AlphaGenome’s published variant scorer normalizes deltas to a per-track z-score against an empirical null distribution drawn from random variants in matched regulatory contexts. Always work with the normalized deltas in downstream code; the raw signal-coverage deltas will mislead you on cross-track comparisons.

Training data, supervision, and the modality stack

AlphaGenome is trained on the union of ENCODE, GTEx, FANTOM5, Roadmap Epigenomics, and 4D Nucleome contact data, with held-out chromosomes used for validation and a separate test chromosome holdout reserved for benchmark reporting. The supervision target for each track is the experimentally measured signal coverage at the relevant resolution, transformed to a variance-stabilized scale.

The numbers that matter for understanding the model’s coverage: ENCODE Phase 4 provides over 25,000 individual experiments spanning DNase, ATAC, ChIP, RNA-seq, and Hi-C across human and mouse. GTEx v10 contributes RNA-seq from 54 tissues and roughly 17,500 samples, which is the source of most tissue-resolved expression supervision. FANTOM5 contributes CAGE across hundreds of primary cell types — the gold-standard data for promoter activity. Roadmap Epigenomics filled in primary tissue chromatin where ENCODE focused on cell lines. The composite training set covers thousands of tracks per modality and is the reason AlphaGenome can answer “what happens in liver” vs. “what happens in cortex” rather than just averaged predictions.

A few non-obvious design details from the technical report shape behavior in production:

Reverse-complement augmentation: every training example is shown to the model in both orientations, and the model is constrained to produce equivariant predictions. This matters because regulatory elements are bidirectional and the model would otherwise overfit strand-specific quirks.
Random shift augmentation: the 1 Mb window is jittered by up to ±32 bp at training time. This forces position-invariance for local features and prevents the model from memorizing absolute positions.
Cross-modal consistency loss: predicted tracks that should correlate biologically (e.g., DNase and H3K27ac at active enhancers) are regularized to be consistent. This is the cross-modality regularization that makes single-modality predictions tighter.
Held-out evaluation chromosomes: chromosomes 8 and 9 are kept entirely out of the training set so benchmark numbers are reported on sequence the model has not seen in any form.

The corpus and the augmentations matter for what the model can and cannot predict. AlphaGenome will be best on regulatory elements similar in flavor to ENCODE/GTEx/FANTOM5 cell types, mediocre on tissues underrepresented in the training data (small intestine, placenta, primary neurons), and weak on individual-genetic-background effects because the supervision data is overwhelmingly from reference genomes.

The held-out chromosome strategy deserves a note because it determines what “test set performance” actually means. By holding out chromosomes 8 and 9 entirely, the model never sees any of the regulatory grammar that happens to live on those chromosomes during training. This is a stricter generalization test than the more common “held-out positions on every chromosome” split — it forces the model to learn transferable sequence patterns rather than chromosome-specific quirks. The published benchmark numbers reflect this stricter split, which means they are more honest measures of generalization than what some prior models reported. When comparing AlphaGenome to older models, account for the fact that the comparison may not be apples-to-apples on the holdout strategy.

Benchmarks: what AlphaGenome actually beats

AlphaGenome reports state-of-the-art results across roughly two dozen public benchmarks and competition-style tasks, beating Enformer, SpliceAI, and Borzoi on most of them with margins that are large enough to matter for downstream use. The headline result is unified performance: a single model outperforms task-specific models trained for that single task.

The benchmark categories most relevant to applied work are:

eQTL effect direction and magnitude (GTEx fine-mapped eQTLs) — predict the sign and magnitude of expression change for fine-mapped expression quantitative trait loci. AlphaGenome reports substantially higher Pearson correlation against measured eQTL effect sizes than Enformer, with the largest gains on distal eQTLs (>50 kb from the target gene) where Enformer’s 196 kb context window literally could not see the variant-gene pair.
Splice variant prediction (SpliceAI test set, Vex-Seq, MFASS) — AlphaGenome matches or beats SpliceAI on the SpliceAI test set and gains meaningful margin on deep-intronic variants where context matters. Crucially, splicing predictions are now consistent with the model’s expression predictions, which SpliceAI cannot do at all.
Chromatin accessibility variant prediction (caQTL benchmarks) — AlphaGenome predicts the direction of chromatin accessibility change for fine-mapped chromatin QTLs at accuracies well above prior single-modality models.
Massively parallel reporter assay (MPRA) prediction — AlphaGenome predicts the activity of synthetic regulatory sequences tested in MPRA experiments with correlations that approach the experimental noise floor on the cleanest datasets.
ClinVar non-coding pathogenicity — AlphaGenome scores stratify ClinVar pathogenic non-coding variants from likely benign at AUC values reported in the high 0.8s on curated subsets, which is a meaningful improvement over CADD and Eigen for the same task.

The single number that captures the model’s reach is the eQTL effect-direction accuracy on distal regulatory variants. Where Enformer hit roughly chance on variants outside its 196 kb window simply because it could not see them, AlphaGenome predicts effect direction at accuracies in the 0.7–0.8 range for variants up to 500 kb from the target gene. That is the kind of step change that moves regulatory-variant prediction from research toy to clinical-grade prior.

A note on calibration. AlphaGenome’s published outputs come with confidence intervals for variant effect summaries, derived from prediction variance across a small ensemble of model heads. This matters operationally because clinical pipelines want not just a point estimate but a “this prediction is reliable” signal. The model is appropriately less confident on under-represented tissues and on variants in repeat-heavy regions, which are exactly the places you would want any reasonable model to express doubt.

A subtler benchmark observation matters for anyone replacing an existing pipeline. The eQTL benchmarks are reported on fine-mapped variants from GTEx, which are biased toward variants with measurable effect sizes — the dataset is enriched for the easy cases. Performance on the full set of GTEx variants, including those whose causal status is uncertain, is meaningfully lower than the headline numbers suggest. The published numbers are the right ones to compare to other models on the same fine-mapped set, but they are not the numbers you should expect when scoring every common variant in a healthy individual. Calibration on real population-genetics workloads will be lower, and that gap is normal.

The other benchmark worth understanding is the saturation mutagenesis assays. Recent MPRA studies that exhaustively test every possible single-base change in a regulatory element generate the cleanest signal-to-noise data the field has. AlphaGenome predicts these saturation maps at correlations approaching the assay’s internal replicate reproducibility, which is the right ceiling. This is where the AlphaFold-moment framing lands hardest: when a model approaches assay-replicate accuracy, the marginal value of computational prediction is genuinely equivalent to running the assay one more time, for variants the assay can be run on. The asymmetry is throughput — the assay tests one regulatory element at a time, while AlphaGenome scores anything in the genome.

Trade-offs and failure modes

AlphaGenome is a major advance, but it has specific limits practitioners need to internalize before wiring it into a pipeline. The trade-offs cluster around resolution, generalization, and what the model fundamentally cannot do.

Tissue resolution, not single-cell resolution. The supervision data is bulk-tissue or sorted-cell-type level. AlphaGenome predicts an average over a population of cells, not the heterogeneous expression you would see in a single-cell RNA-seq dataset. If your question is “does this variant disrupt expression in CD8+ effector memory T cells specifically,” AlphaGenome cannot answer at that granularity. Future versions trained on the growing single-cell ENCODE expansion will close this gap, but as of 2026 the model is tissue-resolved.

Structural variants are partial. The model handles substitutions, indels, and short structural changes that fit cleanly inside the 1 Mb window. Long-range copy-number variants, large inversions, and translocations that span beyond the window cannot be scored end-to-end. For those, the model can score predicted breakpoints but not the global topological consequence.

Out-of-distribution tissues and species. The training corpus is overwhelmingly human and mouse and is biased toward cell types that are easy to culture and sequence. Primary tissues that are underrepresented in ENCODE/GTEx — placenta, certain CNS regions, embryonic stages — get noisier predictions. Predictions for non-mammalian organisms are not supported by the trained model.

Predictions are correlative, not causal. AlphaGenome learns the statistical relationship between sequence and measured signal. It does not, on its own, prove that a variant causes a phenotype — it predicts that a variant is likely to disrupt a regulatory element that is associated with relevant tissues. The clinical curator’s job of integrating that signal with segregation, allele frequency, and functional evidence does not go away.

Computational cost is non-trivial. A full 1 Mb forward pass for both reference and alternate allele takes meaningful GPU time — on the order of a second per variant on a recent NVIDIA H100 with optimized kernels. Scoring every variant in a typical whole-genome at full resolution is a real workload. Most production pipelines pre-filter to variants in regulatory annotations and only run AlphaGenome on the survivors. For the full-genome use case, the throughput limitation is real.

Doesn’t replace functional validation. This is the most important caveat. A high AlphaGenome score is a strong prior; it is not proof. CRISPRi screens, MPRA, allele-specific expression assays, and reporter constructs remain the validation tier. The shift the model creates is in prioritization: which variants are worth the wet-lab time, not which variants you can skip the wet-lab on.

The failure-mode state machine above maps how a production variant-interpretation pipeline should treat AlphaGenome outputs: high-confidence high-effect predictions go to one tier, low-confidence or out-of-distribution predictions get flagged for expert review or default to functional validation rather than algorithmic certainty.

A related and often-underdiscussed failure mode is reference-genome bias. The training corpus is overwhelmingly GRCh38 reference and a handful of cell lines derived from individuals of European ancestry. Variants that are common in non-European populations but rare or absent in reference panels get scored against a regulatory grammar the model has learned from a narrower slice of human diversity. This is not a quirk of AlphaGenome — it is a systemic property of the underlying ENCODE/GTEx data — but it is real, and a clinical pipeline serving a diverse population should treat scores for ancestry-stratified rare variants with appropriate caution. The fix in the long run is broader functional-genomics coverage; the fix in the short run is to log ancestry context alongside the score and not over-claim certainty.

A second under-discussed limit is dynamic-range compression at the extremes. The model is trained to predict measured signal coverage, which has a noise floor at the low end and a saturation ceiling at the high end. Predictions near these limits are systematically biased toward the median. A variant that ablates an already-silent enhancer will produce a near-zero delta not because the model thinks the variant is benign, but because there was no signal to lose. The same goes for variants in already-maximally-active promoters. Practitioners need to check that the reference forward pass produces a meaningful signal in the relevant tracks before trusting a small delta as evidence of no effect.

Practical recommendations for using AlphaGenome

For a clinical or research lab building a non-coding variant interpretation pipeline in 2026, a short checklist:

Use AlphaGenome as a prior, not a verdict. Score variants and use the output to rank candidates for functional follow-up. Do not auto-classify based on score alone.
Pre-filter before scoring. Restrict input to variants in annotated regulatory elements (ENCODE cCREs, ENCODE-rE2G), splice regions, or within fine-mapped credible sets. This cuts compute by ~100× without losing signal.
**Always score in the relevant tissue con

AlphaGenome Explained: Variant Effect Prediction at Scale (2026)

AlphaGenome Explained: Variant Effect Prediction at Scale (2026)

Architecture at a glance

Why non-coding variant effect prediction is the bottleneck

The AlphaGenome reference architecture

Variant scoring as a difference of forward passes

Training data, supervision, and the modality stack

Benchmarks: what AlphaGenome actually beats

Trade-offs and failure modes

Practical recommendations for using AlphaGenome

Related

Comments

Leave a Reply Cancel reply

Tag Cloud

Categories