DNA Language Models: Genomic Foundation Models Explained

The genome is often called the blueprint of life, but a more precise analogy is that it resembles a very long, very noisy document written in a four-letter alphabet. DNA language models — also called genomic foundation models or genome language models — apply the same self-supervised learning logic that made large language models work for text, but they operate on nucleotide sequences instead of words. The core idea sounds almost trivially simple: train a neural network to predict masked or missing bases in a genome using billions of nucleotides as training data, and see whether the learned representations capture something biologically real. The past three years have seen an explosion of activity in this space, from DNABERT and Nucleotide Transformer to HyenaDNA, Caduceus, and the Arc Institute’s Evo and Evo 2. The thesis of this article is that the real breakthrough is not “BERT for DNA.” It is the arrival of long-context architectures — state-space models and Hyena-family operators — that finally let a model perceive regulatory relationships at genuine genome scale. Biological validation, not model perplexity, is the test that matters.

What this covers: how the self-supervised training pipeline works, the tokenization choices and why they matter, a tour of the major model lineages, the long-context problem and how state-space models address it, downstream tasks from variant effect prediction to sequence generation, and an honest account of what these models still cannot do.

The Genome as Language: Why This Framing Works — and Where It Breaks Down

The analogy between a genome and a natural language document is productive, but it requires care. In a text corpus, meaning emerges from word co-occurrence statistics across millions of documents. In a genome, meaning is physically implemented: a transcription factor binding site hundreds of kilobases away from a gene’s promoter can still control that gene’s expression through three-dimensional chromatin looping. The “vocabulary” of DNA is just four nucleotides — A, T, G, C — but the combinatorial regulatory grammar encoded in their arrangement is extraordinarily complex.

What makes the language-modeling framing productive is that the self-supervised pretraining signal works without any labeled data. A model trained to predict masked nucleotides in a genome must, in order to minimize prediction error, implicitly learn what bases are likely given their context. If the model learns that certain hexamer sequences near transcription start sites are conserved across vertebrates, that is because those sequences have functional significance — the model has extracted a biological signal without anyone telling it which positions were regulatory. This is the same logic that makes word embeddings capture semantic relationships: distributional statistics over large corpora encode structure that was never explicitly labeled.

The framing breaks down in a few important ways. Genomes are not linear narratives; they have three-dimensional structure, and the sequence-based model cannot see that structure. They also evolve, recombine, and contain massive amounts of non-functional or structurally constrained sequence that has nothing to do with the kind of semantic content that makes language modeling so powerful for text. These are not fatal objections, but they are reasons to be cautious about treating benchmark performance on in-silico tasks as a proxy for genuine biological understanding.

Why now? Three converging factors explain the timing. First, genome sequencing costs have continued to fall, making multi-species training corpora comprising trillions of nucleotides feasible to assemble. Second, the transformer architecture and its descendants proved scalable in ways that older convolutional and recurrent approaches did not. Third, and most recently, novel architecture families — especially state-space models and Hyena-family operators — removed the quadratic attention bottleneck that had capped the useful context length of transformer-based DNA models at lengths far shorter than the regulatory distances that matter in biology.

How DNA Language Models Work: Self-Supervision, Tokenization, and Architecture

The training pipeline for a genomic foundation model follows a recognizable structure, even if the biological domain introduces complications at every step.

Figure 1: The end-to-end pretraining pipeline — raw nucleotide sequence enters a tokenizer, passes through the model architecture’s layers, and is trained on a self-supervised objective to produce transferable sequence embeddings.

Tokenization: The First Consequential Design Choice

Before a single gradient update occurs, the designer must decide how to convert a raw nucleotide string into discrete tokens. This choice has downstream consequences for everything from vocabulary size to the model’s ability to capture overlapping regulatory motifs.

The three main strategies are k-mer tokenization, single-nucleotide tokenization, and byte-pair encoding (BPE).

K-mer tokenization treats every overlapping or non-overlapping window of k consecutive nucleotides as a single token. DNABERT used 3-mers, 4-mers, 5-mers, and 6-mers in separate model variants. A 6-mer vocabulary covers all 4^6 = 4,096 possible hexamers, which aligns well with the length of canonical transcription factor binding motifs. The problem is that overlapping k-mers produce highly redundant sequences — a shift of one base produces an almost entirely new token sequence — and non-overlapping k-mers lose alignment information at the resolution of individual bases.

Single-nucleotide tokenization avoids this by treating each base as its own token, producing vocabulary sizes of just 4–8 symbols (the four bases plus ambiguity codes). This is the approach used by HyenaDNA and by the Evo family. It is maximally information-preserving at the cost of sequence length: a 1-megabase (1 Mb) genomic region becomes a sequence of one million tokens, which would be completely infeasible for a standard transformer but is tractable for a linear-scaling architecture.

Byte-pair encoding, familiar from GPT-style text models, learns a vocabulary of variable-length substrings by iteratively merging frequent pairs. DNABERT-2 adopted this approach, arguing that it produces a vocabulary that better reflects the actual statistical structure of genomic sequence and allows the model to process longer spans per token without the regularity artifacts of fixed-length k-mers. The tradeoff is that the resulting tokens are not interpretable in terms of known biological motifs.

The Self-Supervised Objective

Two objectives dominate the field, corresponding to the encoder and decoder paradigms in NLP. Masked language modeling (MLM), used by BERT-style models including DNABERT, DNABERT-2, and Nucleotide Transformer, randomly masks a fraction of tokens and trains the model to predict the masked values from context in both directions. This produces rich bidirectional representations well suited to classification tasks. Autoregressive next-token prediction, used by the Evo family, trains the model to predict each token from its left context only. This is less natural for DNA — which has no inherent directionality in the regulatory sense — but it allows the model to generate novel sequences, which MLM cannot do without additional machinery.

The Architecture

For most of the period from 2021 to 2023, the standard choice was a transformer encoder, often initialized from a BERT checkpoint or trained from scratch. The limitation was and remains the O(n²) memory and compute cost of self-attention, which makes processing sequences longer than roughly 10,000–20,000 tokens prohibitively expensive. For reference, the human genome is approximately 3.1 billion base pairs; even a single gene locus with its regulatory environment can span hundreds of kilobases.

This is why state-space models — discussed in depth below — represent a qualitative shift rather than an incremental improvement.

The Model Landscape: Evo 2, Nucleotide Transformer, DNABERT-2, HyenaDNA, and Caduceus

The field has produced a recognizable set of model lineages, each making distinct architectural and training-data choices.

Figure 2: The breadth of downstream applications — a pretrained DNA language model funnels into fine-tuning or zero-shot probing for variant effect prediction, regulatory element detection, expression prediction, novel sequence generation, splice site annotation, and chromatin accessibility modeling.

DNABERT and DNABERT-2

DNABERT, published by Ji et al. in 2021, was the proof-of-concept that a transformer pre-trained on human genome sequence using k-mer MLM could produce embeddings useful for promoter detection, splice site prediction, and transcription factor binding. It established the template. DNABERT-2, published in 2023, replaced k-mer tokenization with BPE, adopted the ALiBi positional encoding to extrapolate beyond trained lengths, and trained on a multi-species corpus, significantly improving performance on species-agnostic tasks. Both models are fundamentally limited to short contexts by their transformer architecture.

Nucleotide Transformer

The Nucleotide Transformer (NT), from InstaDeep and collaborators, scaled the basic masked-nucleotide-prediction approach to models ranging from 500M to 2.5B parameters and trained on human reference genomes alongside multi-species sequence from 1,000 genomes project data. Its key contribution was systematic evaluation: NT was benchmarked on a suite of 18 downstream tasks spanning histone modification, chromatin accessibility, and splice sites, establishing a public reference for comparing subsequent models. The 2025 Nature Communications benchmark study — covering DNABERT-2, Nucleotide Transformer V2, HyenaDNA, Caduceus-Ph, and GROVER — found that while general-purpose DNA foundation models were competitive on pathogenic variant identification, they underperformed specialized models on gene expression prediction and quantitative trait locus (QTL) association tasks.

HyenaDNA

HyenaDNA, from Nguyen et al. (2023), was the first major genomic foundation model to replace transformer attention with a Hyena operator — a long convolution governed by implicit parameterization that scales near-linearly in sequence length. Trained at single-nucleotide resolution, HyenaDNA demonstrated the ability to process contexts up to 1 million tokens at training time and showed that longer contexts improved performance on tasks requiring distal regulatory information. The model is architecturally significant because it established that sub-quadratic operators were not merely a computational convenience but a functional necessity for genome-scale reasoning.

Caduceus and Mamba-Based Approaches

Caduceus, from Schiff et al. (2024), applies the Mamba state-space model — itself a selective scan SSM developed by Gu and Dao — to DNA sequences. Caduceus introduces a bidirectional variant of Mamba (BiMamba) to address the unidirectionality of standard SSM recurrence, and a “reverse complement equivariance” design that respects the biological fact that a DNA strand and its reverse complement are two representations of the same information. Caduceus-Ph, the “PhyloP-augmented” variant, incorporates evolutionary conservation scores directly. In benchmarks, Caduceus is competitive with Nucleotide Transformer on short-context tasks while handling substantially longer sequences at lower memory cost.

Evo and Evo 2: The Current Frontier

The original Evo model from the Arc Institute introduced the StripedHyena architecture — a hybrid that interleaves Hyena long-convolution layers with multi-head attention layers at regular intervals — trained at single-nucleotide resolution on a prokaryotic and phage genome corpus. It demonstrated that a single model could be used for multi-level biological tasks: predicting the essentiality of genes, generating functional CRISPR guide RNAs, and designing protein-coding sequences, all from the same pretrained backbone.

Evo 2, published as a preprint in February 2025 and subsequently published in Nature in March 2026, scales this substantially. Evo 2 is a 40-billion-parameter model trained on over 9 trillion nucleotides drawn from the genomes of more than 100,000 species spanning the entire tree of life — bacteria, archaea, fungi, plants, and animals, including human. Its context window is 1 megabase (1,000,000 nucleotides), processed at single-nucleotide resolution. The model uses a refined StripedHyena 2 architecture that achieves near-linear scaling of both compute and memory with respect to context length. In terms of capability demonstrations, Evo 2 was shown to perform zero-shot variant effect prediction on clinically relevant human variants, generate functional non-coding sequences, and predict the effects of large-scale structural variants — tasks that require understanding genomic context at scales that were simply inaccessible to any prior model. The Arc Institute has released model weights and a HuggingFace-compatible interface, making Evo 2 the current reference point for the field.

The Long-Context Problem and Why State-Space Models Change the Equation

The single most important technical insight in genomic foundation models over the past three years is that context length is not a hyperparameter to tune — it is a biological constraint that the architecture must meet or the model will systematically miss the regulatory logic of the genome.

Figure 3: Standard transformer attention cost scales quadratically with sequence length, capping practical utility at roughly 10,000–20,000 tokens; state-space and Hyena-family architectures scale near-linearly, making 1-megabase single-nucleotide processing tractable and enabling the model to perceive distal regulatory relationships that are invisible to short-context models.

Consider what it means for a model to have a 512-token context window when processing human genomic sequence at the single-nucleotide level. That model can see 512 base pairs at a time — less than one percent of a typical gene’s coding sequence, and a vanishingly small fraction of the regulatory landscape. Enhancers commonly act on genes located 100 kilobases to over a megabase away through chromatin looping. Topologically associating domains (TADs), which organize the genome into locally interacting regulatory neighborhoods, span hundreds of kilobases. A model that cannot see across these distances cannot learn the regulatory grammar that determines when and where genes are expressed.

Standard transformer attention has a memory cost that scales as O(n²) in sequence length. Doubling the context quadruples the memory requirement. This is why DNABERT and the early Nucleotide Transformer variants were limited to sequences of 512 to 4,096 tokens. Tricks like sparse attention and sliding window attention can extend this somewhat, but they break the all-to-all connectivity that makes attention powerful.

State-space models (SSMs) represent a structurally different approach. An SSM maps an input sequence to an output sequence through a hidden state that is updated recurrently — like an RNN — but can be computed in parallel during training through convolutional equivalence. The Mamba architecture (Gu and Dao, 2023) introduced selective state-space mechanics, allowing the model to choose which parts of the input to retain in the hidden state and which to discard, giving it content-aware long-range memory without quadratic cost. Hyena operators achieve similar scaling through implicit long convolutions parameterized by small neural networks, avoiding the learned-attention weight matrix entirely.

The practical consequence for genomics is not subtle. HyenaDNA’s experiments demonstrated that models trained at longer context lengths systematically outperformed the same model trained at shorter context on regulatory element classification tasks — the benefit was not marginal. Evo 2’s 1-megabase context, enabled entirely by the StripedHyena 2 architecture, is what allows it to reason about the relationship between a variant in a non-coding regulatory region and the gene it controls hundreds of kilobases away.

It is worth being precise about what “near-linear scaling” means in practice. For both Hyena-family operators and Mamba-family SSMs, the memory cost scales linearly with sequence length rather than quadratically, and compute per token scales logarithmically or with a small constant factor. This makes processing a 1-megabase sequence feasible on a single high-memory GPU — something that would have required a cluster or have been simply impossible with a standard transformer.

The architectural family does carry its own tradeoffs, discussed below, but the fundamental point stands: for genomics specifically, long-context is not a nice-to-have feature. It is the prerequisite for modeling the regulatory logic that the genome actually uses.

Trade-Offs and What Goes Wrong: Validation, Interpretability, and the Hype Problem

Genomic foundation models are a genuine advance. They are also a field with significant methodological weaknesses, a tendency to overstate benchmark performance, and several fundamental unsolved problems. Being honest about these is not a counsel of pessimism — it is the prerequisite for using these tools usefully.

The Validation Problem

The most serious issue in this field is the gap between in-silico benchmark performance and experimental validation. A model that achieves high accuracy predicting known transcription factor binding sites in held-out genome regions is demonstrating pattern recognition on data that was itself produced by ChIP-seq experiments with their own noise floor and systematic biases. When the same model is asked to predict the functional consequence of a novel variant in a regulatory region where no ChIP-seq data exists, the validation chain becomes indirect and fragile.

The 2025 Nature Communications benchmark (covering DNABERT-2, NT V2, HyenaDNA, Caduceus-Ph, and GROVER) found that general-purpose DNA foundation models underperformed specialized models on gene expression prediction and QTL identification. This is an important result: it means that training on diverse genomic sequence — while producing broadly useful representations — does not automatically confer strong performance on quantitative molecular biology tasks. Fine-tuning helps, but fine-tuning on small labeled datasets reintroduces the label-efficiency problem that foundation models were supposed to solve.

Variant effect prediction is a particularly high-stakes area. Models like Evo 2 demonstrate compelling zero-shot performance on clinically annotated variants from ClinVar, using log-likelihood scoring to assign pathogenicity predictions. But ClinVar itself has biases toward variants in coding regions and toward variants associated with well-studied diseases. The model’s performance on rare, non-coding, or structurally complex variants — precisely the hardest and most clinically important cases — is much harder to evaluate rigorously because ground truth is sparse.

Experimental validation of model-generated sequences is even more demanding. The Evo 2 paper included wet-lab validation of select designed sequences, which is the right standard. But the fraction of in-silico predictions validated experimentally in any published paper is necessarily small, and the publication bias toward successful validations is substantial. Researchers evaluating these models for therapeutic applications should treat published validation numbers as lower bounds on how difficult rigorous validation will be for their specific use case.

The Interpretability Problem

Understanding why a DNA language model makes a prediction is harder than understanding why a text language model makes a prediction, and it was already hard in the text domain. Attribution methods — saliency maps, gradient-based attribution, attention weight visualization — can identify which input positions most influenced a prediction, but they cannot explain the biological mechanism. A model that correctly identifies a variant as pathogenic because it disrupts an obscure regulatory element cannot articulate that reasoning in a way that reveals the element.

Goodfire’s interpretability research on Evo 2 (published in 2025) applied activation-based interpretability methods to probe what the model encodes, finding that certain internal features correspond to recognizable biological concepts. This is encouraging preliminary work, but it is far from the mechanistic transparency that clinical or regulatory applications would require.

The architecture differences between transformer-based and SSM-based models also create different interpretability challenges. Attention weights, for all their limitations, provide at least a visual representation of which positions attend to which. SSM hidden states are dense continuous vectors with no direct analog to attention patterns; the recurrence structure makes it less obvious how to extract position-specific attributions. This is an active research area.

The Hype Problem

The phrase “ChatGPT for the genome” has appeared in a distressing number of press releases. The analogy is misleading in ways that matter. Large language models for text can be evaluated on clear, human-graded tasks — does the translation make sense, is the code correct? Genomic foundation models are evaluated primarily on tasks where ground truth is either computationally derived (other model predictions used as labels) or experimentally sparse (the number of variants with gold-standard functional data is small relative to the space of possible variants). This makes benchmark inflation harder to detect and benchmark numbers harder to interpret.

The field also has a model-scale arms race dynamic, with each new release advertising parameter counts and training data volumes. Scale matters, but the 2025 benchmark study found that scale alone — within the range studied — did not reliably predict downstream task performance. Task-specific fine-tuning, the quality and relevance of pretraining data for the task at hand, and the architectural fit to the sequence length regime of the task were all more important predictors of performance than raw model size.

The honest position is that DNA language models are powerful sequence representation tools that have already accelerated research in regulatory genomics and variant interpretation. They are not oracles of biological function, and they should not be used as such without experimental follow-up.

Practical Recommendations for Researchers and Engineers Evaluating DNA Language Models

Whether you are a computational biologist considering these models for a research pipeline or a machine learning engineer building a genomics product, the following questions and criteria should guide evaluation.

The most consequential choice is matching the model to the sequence length regime your task requires. If your task involves short regulatory elements — promoters, TATA boxes, splice sites — and your evaluation sequences are a few hundred to a few thousand bases long, transformer-based models like DNABERT-2 or Nucleotide Transformer are well-characterized and benchmarked. If your task requires reasoning about distal regulatory relationships — enhancer-promoter links, structural variant effects, whole-gene regulatory environments — you need a model with a context window that can see the relevant distances, which means HyenaDNA, Caduceus, or Evo 2. Running a 512-token model on a long-range regulatory task and attributing poor performance to “the limits of AI” is a methodological error.

The second most consequential choice is evaluation. Do not use model perplexity or held-out reconstruction loss as your primary evaluation metric for biological utility. These measure how well the model has learned the statistics of its training corpus, not how well it captures functional biology. Benchmark your model on tasks where you have experimentally validated ground truth, even if that means a smaller and more challenging test set. The 2025 Nature Communications benchmark is a reasonable starting point for comparison, but treat it as a floor, not a ceiling — your task may differ from the benchmark tasks in ways that significantly change which model performs best.

For variant effect prediction specifically, compare model predictions against functional assays (MPRA data, saturation genome editing results, clinical outcome data) rather than against other model predictions. This is slower and more expensive, but it is the only way to know whether the model is capturing biology rather than database biases.

The following checklist summarizes the minimum diligence for responsible deployment:

Confirm the model’s pretraining species matches or encompasses your organism of interest — a model trained only on prokaryotic genomes will not transfer to human regulatory sequence.
Verify context length sufficiency for your task’s relevant regulatory distances before benchmarking.
Establish a baseline using conservation-based methods (phyloP scores, GERP) — genomic foundation models should exceed this baseline to justify their complexity.
Test on held-out chromosomes or species, not just held-out genomic positions, to avoid inflated performance from sequence similarity between train and test sets.
For any prediction you plan to act on experimentally, validate at least a representative sample in a wet-lab assay.
Track model versioning — the field moves quickly and “Nucleotide Transformer” without a version number may refer to substantially different models with different performance characteristics.
Engage with the model’s reported limitations in the original paper, not just its reported successes.

Frequently Asked Questions

What is the difference between a DNA language model and a protein language model like ESM?
Both apply self-supervised pretraining to biological sequences, but the domains differ significantly. Protein language models like ESM-2 operate on amino acid sequences (20-letter alphabet) and are trained on the relatively compact space of known protein sequences — tens of millions of sequences, each typically hundreds to a few thousand residues. DNA language models operate on a four-letter nucleotide alphabet over sequences that are orders of magnitude longer, with training corpora spanning entire genomes. The downstream tasks also differ: protein models are primarily used for structure prediction and functional annotation; DNA models target regulatory logic, variant effects, and gene expression. For the overlap between the two — predicting the effect of a coding variant on protein function — tools like AlphaFold 3 and AlphaGenome integrate signals from both sequence domains.

Can Evo 2 design new functional DNA sequences from scratch?
Yes, in a conditional generation sense. Evo 2 was trained with an autoregressive next-token-prediction objective, which means it can generate novel sequences by sampling from its learned conditional distributions. The Arc Institute demonstrated generation of putatively functional non-coding sequences and protein-coding sequences in the Evo 2 paper. What “functional” means here is defined by in-silico scoring — the generated sequences score well on held-out functional prediction tasks. Wet-lab validation of generated sequences was performed for selected examples, showing that some model-designed sequences had measurable biological activity. The caveat is that validated generation examples represent a small fraction of what the model can produce, and generation quality for complex regulatory architectures is still an open research question.

How does zero-shot variant effect prediction work in DNA language models?
The standard approach uses the model’s log-likelihood scores as a proxy for fitness. For a given genomic position, you compare the log-probability the model assigns to the reference allele versus the alternate allele, in context. Variants for which the model assigns much lower probability to the alternate allele than the reference are predicted to be deleterious — the model has “seen” the reference allele pattern many times across the training corpus and learned it as the statistically expected sequence. This approach does not require any fine-tuning or labeled variant data. Its limitation is that log-likelihood reflects sequence frequency in the training corpus, which correlates with but is not identical to functional constraint.

What is the relationship between DNA language models and tools like Enformer or Borzoi?
Enformer and Borzoi are sequence-to-function models: they take DNA sequence as input and predict quantitative functional readouts like gene expression levels and chromatin accessibility across many cell types. They use large labeled training datasets of functional genomics experiments. DNA foundation models, by contrast, are pretrained on unlabeled sequence and produce general embeddings. In practice, the two approaches are complementary: a genomic foundation model can serve as a feature extractor whose representations are fine-tuned toward the same functional prediction tasks that Enformer tackles. Whether foundation-model-based approaches will surpass purpose-built sequence-to-function models on those tasks is an active research question.

Are DNA language models species-specific?
This varies by model. DNABERT was trained primarily on human reference sequence. Nucleotide Transformer trained on human and 1000 Genomes data. Evo was trained on prokaryotic and phage genomes. Evo 2 is the most broadly trained, covering more than 100,000 species across all domains of life. Multi-species pretraining is generally beneficial because evolutionary conservation provides a natural labeling signal — positions conserved across species are more likely to be functionally important — but it also means the model must allocate capacity across enormously diverse genomic architectures. For tasks specific to human regulatory genomics, models with heavy human-sequence representation may outperform models with more even multi-species coverage.

What does it cost to run inference on a model like Evo 2?
Evo 2 at 40 billion parameters is large enough that full-precision inference requires multiple high-memory GPUs or significant cloud compute. The Arc Institute provides a hosted API and has released quantized variants that reduce memory requirements. For most academic research applications involving single-gene or single-locus queries, running inference on a cloud instance for a few seconds per query is tractable. Running genome-scale inference across millions of variants — as would be required for a population-scale GWAS fine-mapping application — requires substantial compute infrastructure and cost planning that is not substantially different from running any other large foundation model at scale.

DNA Language Models: Genomic Foundation Models Explained

DNA Language Models: Genomic Foundation Models Explained

The Genome as Language: Why This Framing Works — and Where It Breaks Down

How DNA Language Models Work: Self-Supervision, Tokenization, and Architecture

Tokenization: The First Consequential Design Choice

The Self-Supervised Objective

The Architecture

The Model Landscape: Evo 2, Nucleotide Transformer, DNABERT-2, HyenaDNA, and Caduceus

DNABERT and DNABERT-2

Nucleotide Transformer

HyenaDNA

Caduceus and Mamba-Based Approaches

Evo and Evo 2: The Current Frontier

The Long-Context Problem and Why State-Space Models Change the Equation

Trade-Offs and What Goes Wrong: Validation, Interpretability, and the Hype Problem

The Validation Problem

The Interpretability Problem

The Hype Problem

Practical Recommendations for Researchers and Engineers Evaluating DNA Language Models

Frequently Asked Questions

Further Reading

Related

Comments

Leave a Reply Cancel reply

Tag Cloud

Categories