Learn
1DC(=O)Nc1ccc(O)cc1CC#NSMILES2DNOGRAPH3DxyzCOORDS
Representations · Fundamentals

Molecular Representations for Machine Learning

SMILES, molecular graphs, fingerprints, and 3D coordinates — how molecules become data

7 min read

Before any machine learning model can process a molecule, the molecule must be converted into a mathematical representation — a set of numbers that a computer can work with. The choice of representation fundamentally shapes what the model can learn and how well it performs. Different representations capture different aspects of molecular structure, and no single representation is optimal for all tasks.

SMILES Strings

SMILES (Simplified Molecular-Input Line-Entry System) is a text-based notation for describing molecular structures. Each molecule is encoded as a string of characters: atoms are represented by their chemical symbols, bonds by special characters (= for double bonds, # for triple bonds, with single bonds implicit), branches by parentheses, and ring closures by matching digits. For example, ethanol is "CCO", benzene is "c1ccccc1", and aspirin is "CC(=O)Oc1ccccc1C(=O)O".

SMILES representations enable the use of natural language processing architectures — recurrent neural networks, transformers, and other sequence models — for molecular property prediction and generation. Models can be trained to generate valid SMILES strings, effectively generating novel molecules character by character.

A key limitation of SMILES is that they are not unique: the same molecule can be represented by multiple different SMILES strings depending on atom ordering. Canonical SMILES algorithms (such as RDKit's canonicalization) assign a deterministic ordering to resolve this, but the representation still obscures certain structural features. SELFIES (Self-Referencing Embedded Strings), developed by Aspuru-Guzik and colleagues, is an alternative string representation where every string corresponds to a valid molecule — a useful property for generative models that eliminates the problem of generating syntactically invalid SMILES.

Molecular Fingerprints

Molecular fingerprints encode structural features as fixed-length binary or count vectors. The most widely used are Extended-Connectivity Fingerprints (ECFPs), also known as Morgan fingerprints. These work by iteratively examining the neighborhood of each atom: at radius 0, only the atom itself; at radius 1, the atom and its immediate neighbors; at radius 2, neighbors of neighbors; and so on. Each unique substructure found is hashed to a position in a bit vector. ECFP4 (radius 2) and ECFP6 (radius 3) with 1024 or 2048 bits are common choices.

Fingerprints have been used for decades in cheminformatics for similarity searching, clustering, and as inputs to classical ML models (random forests, support vector machines, gradient-boosted trees). They remain competitive baselines: random forests on Morgan fingerprints often match or beat more sophisticated GNN models on molecular property prediction benchmarks, particularly when training data is limited.

Molecular Graphs and GNNs

A molecule is naturally a graph: atoms are nodes and bonds are edges. Graph neural networks (GNNs) operate directly on this representation. In a typical message-passing neural network, each atom starts with a feature vector (encoding element type, charge, hybridization, etc.), and these features are iteratively updated by aggregating information from neighboring atoms. After several rounds of message passing, a global pooling operation combines all atom-level representations into a single molecular representation that can be used for property prediction.

GNN architectures for molecules include GCN (Graph Convolutional Networks), GAT (Graph Attention Networks, which learn to weight neighbor contributions), MPNN (Message Passing Neural Networks, as formalized by Gilmer et al.), and SchNet (which incorporates continuous interatomic distances). The Chemprop library, widely used in pharmaceutical research, implements a directed-MPNN architecture specifically designed for molecular property prediction.

Graph representations capture molecular topology — which atoms are connected to which — but standard 2D graph representations do not capture 3D spatial arrangement. This is adequate for many property prediction tasks but insufficient for tasks where 3D structure matters, such as binding pose prediction.

3D Coordinate Representations

For structure-based drug design, molecules must be represented in 3D space. This means specifying the (x, y, z) coordinates of each atom. 3D representations capture spatial relationships that 2D graphs miss: the shape of a binding pocket, the distance between pharmacophoric features, and steric clashes. Generating 3D conformations from 2D molecular graphs is itself a computational task (conformer generation), typically performed with tools like RDKit's ETKDG algorithm or the OpenEye Omega tool.

SE(3)-equivariant neural networks are a class of architectures designed to respect the symmetries of 3D space. A molecular property should not change if you rotate or translate the molecule in space — it is invariant to these transformations. SE(3)-equivariant models (such as EGNN, TFN — Tensor Field Networks, PaiNN, and MACE) build this symmetry into the network architecture, ensuring that the model produces consistent outputs regardless of the molecule's orientation. These architectures have become essential for tasks like molecular dynamics simulation, binding pose prediction, and force field learning.

Which Representation for Which Task?

The optimal representation depends on the task. For property prediction with limited data, simple fingerprints with classical ML models often perform well. For property prediction with larger datasets, GNNs on 2D molecular graphs typically outperform fingerprints. For molecular generation, SMILES-based models and SELFIES are widely used for their compatibility with language modeling architectures. For binding pose prediction and structure-based design, 3D coordinate representations with SE(3)-equivariant networks are necessary. For protein-ligand interaction modeling, representations that jointly encode the protein and ligand in 3D space are increasingly used. In practice, many state-of-the-art models combine multiple representations — for example, using a 2D GNN for molecular encoding alongside 3D coordinates for docking.

SharePostShare

Continue reading

Stay current

Weekly digest of AI drug discovery developments. No noise.