AI for Target Identification
Finding the right protein to drug — how machine learning mines omics data for novel targets
Before you can design a drug, you need to know what to drug. Target identification — determining which protein, gene, or pathway to modulate in order to treat a disease — is the first and arguably most consequential step in drug discovery. If you choose the wrong target, every subsequent step is wasted. Historically, a substantial fraction of clinical trial failures are attributed to lack of efficacy, which often traces back to an insufficiently validated target.
What Target Identification Involves
A drug target is typically a protein whose activity contributes to a disease when dysregulated. The ideal target is causally involved in the disease (not merely correlated), is "druggable" (has a binding site amenable to small molecules or biologics), and can be modulated without unacceptable side effects. Identifying such targets requires integrating evidence from genetics, molecular biology, clinical data, and increasingly, large-scale omics datasets.
Traditional target identification relied heavily on academic research, literature mining, and hypothesis-driven experiments. A researcher might study a specific signaling pathway for years before proposing it as a drug target. This approach is deep but narrow and inherently biased toward well-studied biology.
Multi-Omics Approaches
The explosion of biological data — collectively termed "omics" — has created new opportunities for systematic target discovery. Genomics data, particularly from genome-wide association studies (GWAS), identifies genetic variants statistically associated with disease risk. If a loss-of-function variant in a gene protects against a disease, that gene's protein product is a compelling drug target. This logic underpinned the development of PCSK9 inhibitors for cardiovascular disease.
Transcriptomics (gene expression data from RNA sequencing) reveals which genes are abnormally active in disease states. By comparing gene expression in diseased versus healthy tissue, researchers can identify potential targets. Single-cell RNA sequencing adds resolution, showing which cell types within a tissue express a target.
Proteomics measures the actual protein levels and modifications in cells and tissues, providing a more direct readout of biology than gene expression alone. Mass spectrometry-based proteomics has matured significantly, and large-scale proteomic datasets are increasingly available.
Integrating these data types — genomics, transcriptomics, proteomics, and others like metabolomics and epigenomics — is a natural fit for machine learning, which can detect patterns across high-dimensional datasets that are invisible to manual analysis.
Knowledge Graphs
Several AI drug discovery companies use knowledge graphs to integrate heterogeneous biological data for target identification. A knowledge graph represents entities (genes, proteins, diseases, drugs, pathways) as nodes and their relationships as edges, drawing from databases like UniProt, STRING, KEGG, and published literature.
BenevolentAI has been a prominent proponent of this approach. Their knowledge graph integrates structured data from biomedical databases with relationships extracted from scientific literature using natural language processing. Graph neural networks and reasoning algorithms traverse this graph to identify novel connections between diseases and potential targets. BenevolentAI used this approach to identify baricitinib (a JAK inhibitor already approved for rheumatoid arthritis) as a potential treatment for COVID-19, which was subsequently validated in clinical trials.
AI Platforms for Target Discovery
PandaOmics, developed by Insilico Medicine, is a target identification platform that applies deep learning to multi-omics data, text mining, and pathway analysis. It scores potential targets based on multiple lines of evidence — genetic association, expression changes, druggability, novelty, and existing clinical data — and provides ranked target lists for a given disease of interest. Insilico used PandaOmics to identify a novel target for idiopathic pulmonary fibrosis, leading to the development of INS018_055.
Other companies working on AI-driven target identification include Recursion, which uses phenomics (high-content cellular imaging at scale) to map biological perturbations and infer target-disease relationships, and Exscientia, which combined knowledge graph approaches with patient tissue analysis for target selection.
Causal Inference and Validation
One of the deepest challenges in target identification is distinguishing causal relationships from correlations. A gene may be differentially expressed in a disease without being a good drug target — it could be a consequence of the disease rather than a cause. Mendelian randomization, a statistical technique that uses genetic variants as natural experiments, can provide evidence for causal relationships between a gene and a disease outcome. Machine learning approaches are increasingly used to automate and scale causal inference from genetic and clinical data.
Ultimately, computational target identification must be validated experimentally. CRISPR-based gene knockout and knockdown experiments, patient-derived cell models, and animal disease models are used to confirm that modulating a predicted target has the desired therapeutic effect. The gap between computational prediction and biological validation remains significant, and high-confidence computational targets still have a meaningful failure rate in validation experiments.