Learn
NOnoisestructure
Generative Chemistry · Models

Generative Models in Drug Design

How diffusion models, VAEs, and language models are designing novel molecules

7 min read

Generative models in drug design do something conceptually different from traditional screening: instead of searching through existing chemical libraries for molecules that might work, they generate novel molecules optimized for desired properties. This dramatically expands the search space beyond known chemistry.

The Chemical Space Problem

The number of possible drug-like small molecules is estimated at 1060 — far more than could ever be synthesized or tested. Traditional drug discovery samples a tiny fraction of this space (existing compound libraries contain millions to low billions of molecules). Generative models navigate this space computationally, proposing molecules that are likely to have desired properties without needing to enumerate every possibility.

Molecular Representations

To generate molecules computationally, they must be represented in a format that models can process. Common representations include:

  • SMILES — text strings that encode molecular structure (e.g., "CC(=O)OC1=CC=CC=C1C(=O)O" for aspirin). Language models can generate molecules by generating SMILES strings.
  • Molecular graphs — atoms as nodes, bonds as edges. Graph neural networks (GNNs) operate on this representation directly.
  • 3D coordinates — atom positions in space. Required for structure-based design and used by diffusion models.

Key Model Architectures

Variational Autoencoders (VAEs) learn a compressed "latent space" of molecular properties. Molecules are encoded into this space, and new molecules are generated by sampling from it. This allows smooth interpolation between known molecules and optimization in latent space.

Diffusion models have become prominent in molecular generation. These models learn to reverse a noise-adding process: starting from random noise and gradually refining it into a valid molecular structure. RFDiffusion, from the Baker lab, uses this approach for protein design — generating novel protein backbones that fold into desired shapes.

Language models treat molecular design as a sequence generation problem. Models trained on SMILES strings can generate novel molecules autoregressively (character by character). Larger models like those behind Chemistry42 combine language model generation with reinforcement learning to optimize for specific drug properties.

Multi-Property Optimization

A drug candidate must satisfy many constraints simultaneously: high binding affinity to the target, selectivity against off-targets, metabolic stability, solubility, synthetic accessibility, and low toxicity. This multi-objective optimization is where generative models offer the most value over human intuition. Modern platforms use reinforcement learning or Pareto optimization to navigate these trade-offs automatically.

Validation Gap

The central challenge in generative drug design is the gap between computational prediction and experimental validation. A model can generate a molecule predicted to have excellent properties, but the prediction may be wrong. Closing this loop — generating, synthesizing, testing, and feeding results back into the model — is where companies like Recursion (with automated labs) and Absci (with high-throughput protein production) differentiate themselves.

SharePostShare

Continue reading
TARGETSCREENOPTIMIZETEST

How AI Is Changing Drug Discovery

A stage-by-stage look at where machine learning enters the pharmaceutical pipeline

8 min read
M E T H I O N I N E · A L A · G L Y

Protein Structure Prediction Explained

From the protein folding problem to AlphaFold, Boltz, and co-folding models

10 min read
DOCKINGSCOREhit 1hit 2hit 310M3

Virtual Screening and Molecular Docking

How computational methods sift through billions of molecules to find drug candidates

8 min read
LIVERBLOODAabsorbDdistribMmetabE / T

ADMET Prediction with Machine Learning

Why most drug candidates fail and how AI predicts absorption, metabolism, and toxicity early

7 min read
EGFRJAK2TP53KRASPI3KmTORRASknowninferred

AI for Target Identification

Finding the right protein to drug — how machine learning mines omics data for novel targets

8 min read
FcVLCLVHCH1CDRCDRAgFabFab

AI-Driven Antibody and Biologics Design

From traditional hybridoma screening to de novo computational antibody generation

9 min read
1DC(=O)Nc1ccc(O)cc1CC#NSMILES2DNOGRAPH3DxyzCOORDS

Molecular Representations for Machine Learning

SMILES, molecular graphs, fingerprints, and 3D coordinates — how molecules become data

7 min read
[CLS]MetAlaGlySerL1L2LN...houtself-attn+ FFNTRANSFORMER ENCODER

Foundation Models in Biology

How protein language models and biological LLMs are creating a new paradigm for drug discovery

9 min read
LARGEPHARMAAI-FIRSTBIOTECHDISCOVERYPLATFORMCLINICALSTAGE

The AI Drug Discovery Landscape

A map of the companies, funding, partnerships, and clinical programs reshaping pharma

10 min read

Stay current

Weekly digest of AI drug discovery developments. No noise.