Protein Structure Prediction Explained
From the protein folding problem to AlphaFold, Boltz, and co-folding models
Proteins are molecular machines. Their function is determined by their 3D shape, which is determined by their amino acid sequence. Predicting this shape from sequence alone — the "protein folding problem" — was one of biology's grand challenges for over 50 years.
Why Structure Matters for Drug Discovery
Most drugs work by binding to proteins and altering their function. To design a drug that binds to a specific protein, you need to know the protein's 3D structure — particularly the shape of the binding site where the drug will attach. Without structural information, drug design is largely trial and error.
Experimental methods for determining protein structure — X-ray crystallography, cryo-electron microscopy (cryo-EM), and NMR spectroscopy — are accurate but slow. A single structure can take months to years to solve, and some proteins resist experimental characterization entirely (notably membrane proteins and disordered regions).
AlphaFold
In 2020, DeepMind's AlphaFold2 demonstrated that deep learning could predict protein structures from amino acid sequences with accuracy comparable to experimental methods. It won the CASP14 competition (the field's benchmark) by a wide margin.
AlphaFold2 uses a neural network architecture called Evoformer, which processes multiple sequence alignments (MSAs) — comparisons of the target sequence against evolutionary relatives — to extract structural information encoded by evolution. The model predicts inter-residue distances and angles, then assembles these into a full 3D structure.
In 2024, DeepMind released AlphaFold3, which extended the approach to predict the structure of protein complexes with other molecules: other proteins, DNA, RNA, and small molecule ligands. This is directly relevant to drug discovery, where understanding how a drug binds to its target is the central design question.
Boltz and Co-Folding
Boltz, developed by MIT researchers, is an open-source alternative to AlphaFold3 focused on "co-folding" — predicting the structure of a protein bound to a small molecule simultaneously. Boltz-1 matched AlphaFold3's accuracy on protein-ligand complex prediction. Boltz-2 added the ability to predict binding affinity (how tightly the drug binds), not just binding pose (where it binds) — an important distinction for drug design.
Other Models
RoseTTAFold, from the Baker lab at University of Washington, is an open-source model that uses a "three-track" architecture processing 1D sequence, 2D distance maps, and 3D coordinates simultaneously. RoseTTAFold All-Atom (RFAA) extended this to model proteins alongside small molecules, nucleic acids, and metal ions.
ESMFold, from Meta AI, takes a different approach: instead of using multiple sequence alignments, it uses a protein language model (ESM-2) trained on millions of sequences. This makes it much faster than AlphaFold2 (seconds vs. minutes per structure) at the cost of some accuracy, particularly for proteins with few evolutionary relatives.
OpenFold is an open-source reimplementation of AlphaFold2 built for training on custom data, enabling researchers to fine-tune structure prediction for specific protein families.
Impact on Drug Discovery
Before AlphaFold, structural coverage of the human proteome was roughly 35%. Now, predicted structures are available for nearly every human protein. This has enabled structure-based drug design for targets that previously had no structural information. The key remaining challenges are accuracy for protein-ligand complexes, prediction of protein dynamics (proteins are not static), and modeling of disordered regions that lack a fixed structure.