Generative Models in Drug Design
How diffusion models, VAEs, and language models are designing novel molecules
Generative models in drug design do something conceptually different from traditional screening: instead of searching through existing chemical libraries for molecules that might work, they generate novel molecules optimized for desired properties. This dramatically expands the search space beyond known chemistry.
The Chemical Space Problem
The number of possible drug-like small molecules is estimated at 1060 — far more than could ever be synthesized or tested. Traditional drug discovery samples a tiny fraction of this space (existing compound libraries contain millions to low billions of molecules). Generative models navigate this space computationally, proposing molecules that are likely to have desired properties without needing to enumerate every possibility.
Molecular Representations
To generate molecules computationally, they must be represented in a format that models can process. Common representations include:
- SMILES — text strings that encode molecular structure (e.g., "CC(=O)OC1=CC=CC=C1C(=O)O" for aspirin). Language models can generate molecules by generating SMILES strings.
- Molecular graphs — atoms as nodes, bonds as edges. Graph neural networks (GNNs) operate on this representation directly.
- 3D coordinates — atom positions in space. Required for structure-based design and used by diffusion models.
Key Model Architectures
Variational Autoencoders (VAEs) learn a compressed "latent space" of molecular properties. Molecules are encoded into this space, and new molecules are generated by sampling from it. This allows smooth interpolation between known molecules and optimization in latent space.
Diffusion models have become prominent in molecular generation. These models learn to reverse a noise-adding process: starting from random noise and gradually refining it into a valid molecular structure. RFDiffusion, from the Baker lab, uses this approach for protein design — generating novel protein backbones that fold into desired shapes.
Language models treat molecular design as a sequence generation problem. Models trained on SMILES strings can generate novel molecules autoregressively (character by character). Larger models like those behind Chemistry42 combine language model generation with reinforcement learning to optimize for specific drug properties.
Multi-Property Optimization
A drug candidate must satisfy many constraints simultaneously: high binding affinity to the target, selectivity against off-targets, metabolic stability, solubility, synthetic accessibility, and low toxicity. This multi-objective optimization is where generative models offer the most value over human intuition. Modern platforms use reinforcement learning or Pareto optimization to navigate these trade-offs automatically.
Validation Gap
The central challenge in generative drug design is the gap between computational prediction and experimental validation. A model can generate a molecule predicted to have excellent properties, but the prediction may be wrong. Closing this loop — generating, synthesizing, testing, and feeding results back into the model — is where companies like Recursion (with automated labs) and Absci (with high-throughput protein production) differentiate themselves.