Foundation Models in Biology
How protein language models and biological LLMs are creating a new paradigm for drug discovery
Foundation models — large neural networks pre-trained on massive datasets that can be adapted to a wide range of downstream tasks — have transformed natural language processing and computer vision. The same paradigm is now reshaping biology and drug discovery. Protein language models, molecular foundation models, and multi-modal biological models are enabling researchers to extract more from data and make predictions that were previously intractable.
What Makes a Foundation Model
A foundation model has three defining characteristics: it is trained on a large, broad dataset (often through self-supervised learning, without task-specific labels); it learns general-purpose representations that capture deep structure in the data; and it can be fine-tuned or adapted for many different downstream tasks with relatively little task-specific data. In natural language processing, models like GPT and BERT demonstrated this paradigm. In biology, the equivalent "language" is the sequence of amino acids in proteins or the sequence of nucleotides in DNA and RNA.
Protein Language Models
ESM-2 (Evolutionary Scale Modeling), developed by Meta AI's Fundamental AI Research (FAIR) team, is the most widely known protein language model. ESM-2 was trained on tens of millions of protein sequences from the UniRef database using a masked language modeling objective — the model learns to predict randomly masked amino acids from their sequence context, similar to how BERT learns to predict masked words. The largest ESM-2 model has 15 billion parameters.
The key insight is that evolutionary relationships between proteins are encoded in sequence databases. By training on millions of sequences, protein language models learn to represent evolutionary constraints — which positions in a protein tolerate mutations, which positions co-evolve, and how sequence relates to structure and function. These learned representations are useful for a wide range of tasks: predicting protein structure (ESMFold), predicting the effect of mutations, predicting protein function, and generating novel protein sequences.
ProtTrans, a collection of protein language models developed at the Technical University of Munich, explored various architectures (BERT, T5, XLNet, and others) applied to protein sequences. ProtT5-XL-UniRef50 has become a widely used model for generating protein embeddings. ProGen, developed at Salesforce Research, is an autoregressive protein language model trained to generate protein sequences conditioned on functional tags, enabling controllable protein generation.
Pre-Training on Evolutionary Data
Protein language models are trained on sequence databases that contain the accumulated results of billions of years of evolution. The UniRef and UniParc databases contain hundreds of millions of protein sequences from across the tree of life. Evolution acts as a massive experiment: mutations that break a protein's function are selected against, while those that maintain or improve function are preserved. The statistical patterns of conservation and co-variation in these sequences encode deep information about protein structure, stability, and function.
This is why pre-training works: a model that learns to predict masked amino acids must learn the biophysical constraints that govern protein sequences. The representations that emerge encode secondary structure, contact maps, binding sites, and functional annotations — all without being explicitly trained on any of these labels.
Transfer Learning for Drug Discovery Tasks
The practical value of protein foundation models lies in transfer learning. Once a model has learned general protein representations from millions of sequences, it can be fine-tuned on small, task-specific datasets with much better performance than training from scratch. Applications include:
- Protein function prediction — classifying proteins by Gene Ontology terms or enzyme commission numbers
- Mutation effect prediction — estimating whether a point mutation will be deleterious or beneficial, relevant for understanding disease variants and engineering proteins
- Binding site prediction — identifying which residues in a protein are likely to bind ligands or interact with other proteins
- Protein engineering — guiding directed evolution by predicting which mutations improve stability, activity, or specificity
Molecular Foundation Models
The foundation model paradigm extends beyond proteins to small molecules. Models pre-trained on large collections of molecular structures (from databases like ZINC, ChEMBL, and PubChem) learn general-purpose molecular representations. MolBERT and related models apply BERT-style masked language modeling to SMILES strings. Uni-Mol, from DP Technology, pre-trains a transformer on 3D molecular conformations. These models can then be fine-tuned for property prediction, virtual screening, and molecular generation tasks, often outperforming models trained from scratch on specific tasks.
Multi-Modal and Cross-Domain Models
An emerging frontier is multi-modal foundation models that integrate information across different biological data types: protein sequences, molecular structures, gene expression data, clinical records, and scientific text. The rationale is that drug discovery inherently spans multiple data modalities — a single drug program involves protein structures, chemical compounds, assay data, clinical outcomes, and published literature. Models that can reason across these modalities could enable novel insights.
Xaira Therapeutics, launched in 2024 with over $1 billion in funding, has positioned foundation models as central to its drug discovery strategy. Founded by former leadership from Illumina and the Allen Institute, with involvement from key researchers in protein language models, Xaira's approach centers on building large-scale biological foundation models trained on proprietary and public data to drive target discovery, molecular design, and clinical prediction.
Scaling Laws and Limitations
In NLP, scaling laws describe predictable relationships between model size, dataset size, and performance. Preliminary evidence suggests that similar scaling behavior exists for protein language models — larger models trained on more data tend to produce better representations. However, biological scaling laws are less well-characterized than in NLP, and it remains unclear how far scaling alone will take us.
Key limitations of biological foundation models include: the sequence-function relationship is more complex than the word-meaning relationship in language; experimental data for fine-tuning is orders of magnitude scarcer than text data; many important biological phenomena (protein dynamics, cellular context, tissue-level interactions) are poorly captured by sequence data alone; and evaluation is difficult because ground truth biological labels are often noisy or incomplete. Foundation models are a powerful tool for drug discovery, but they complement rather than replace experimental biology.