Learn
EGFRJAK2TP53KRASPI3KmTORRASknowninferred
Target Discovery · Omics

AI for Target Identification

Finding the right protein to drug — how machine learning mines omics data for novel targets

8 min read

Before you can design a drug, you need to know what to drug. Target identification — determining which protein, gene, or pathway to modulate in order to treat a disease — is the first and arguably most consequential step in drug discovery. If you choose the wrong target, every subsequent step is wasted. Historically, a substantial fraction of clinical trial failures are attributed to lack of efficacy, which often traces back to an insufficiently validated target.

What Target Identification Involves

A drug target is typically a protein whose activity contributes to a disease when dysregulated. The ideal target is causally involved in the disease (not merely correlated), is "druggable" (has a binding site amenable to small molecules or biologics), and can be modulated without unacceptable side effects. Identifying such targets requires integrating evidence from genetics, molecular biology, clinical data, and increasingly, large-scale omics datasets.

Traditional target identification relied heavily on academic research, literature mining, and hypothesis-driven experiments. A researcher might study a specific signaling pathway for years before proposing it as a drug target. This approach is deep but narrow and inherently biased toward well-studied biology.

Multi-Omics Approaches

The explosion of biological data — collectively termed "omics" — has created new opportunities for systematic target discovery. Genomics data, particularly from genome-wide association studies (GWAS), identifies genetic variants statistically associated with disease risk. If a loss-of-function variant in a gene protects against a disease, that gene's protein product is a compelling drug target. This logic underpinned the development of PCSK9 inhibitors for cardiovascular disease.

Transcriptomics (gene expression data from RNA sequencing) reveals which genes are abnormally active in disease states. By comparing gene expression in diseased versus healthy tissue, researchers can identify potential targets. Single-cell RNA sequencing adds resolution, showing which cell types within a tissue express a target.

Proteomics measures the actual protein levels and modifications in cells and tissues, providing a more direct readout of biology than gene expression alone. Mass spectrometry-based proteomics has matured significantly, and large-scale proteomic datasets are increasingly available.

Integrating these data types — genomics, transcriptomics, proteomics, and others like metabolomics and epigenomics — is a natural fit for machine learning, which can detect patterns across high-dimensional datasets that are invisible to manual analysis.

Knowledge Graphs

Several AI drug discovery companies use knowledge graphs to integrate heterogeneous biological data for target identification. A knowledge graph represents entities (genes, proteins, diseases, drugs, pathways) as nodes and their relationships as edges, drawing from databases like UniProt, STRING, KEGG, and published literature.

BenevolentAI has been a prominent proponent of this approach. Their knowledge graph integrates structured data from biomedical databases with relationships extracted from scientific literature using natural language processing. Graph neural networks and reasoning algorithms traverse this graph to identify novel connections between diseases and potential targets. BenevolentAI used this approach to identify baricitinib (a JAK inhibitor already approved for rheumatoid arthritis) as a potential treatment for COVID-19, which was subsequently validated in clinical trials.

AI Platforms for Target Discovery

PandaOmics, developed by Insilico Medicine, is a target identification platform that applies deep learning to multi-omics data, text mining, and pathway analysis. It scores potential targets based on multiple lines of evidence — genetic association, expression changes, druggability, novelty, and existing clinical data — and provides ranked target lists for a given disease of interest. Insilico used PandaOmics to identify a novel target for idiopathic pulmonary fibrosis, leading to the development of INS018_055.

Other companies working on AI-driven target identification include Recursion, which uses phenomics (high-content cellular imaging at scale) to map biological perturbations and infer target-disease relationships, and Exscientia, which combined knowledge graph approaches with patient tissue analysis for target selection.

Causal Inference and Validation

One of the deepest challenges in target identification is distinguishing causal relationships from correlations. A gene may be differentially expressed in a disease without being a good drug target — it could be a consequence of the disease rather than a cause. Mendelian randomization, a statistical technique that uses genetic variants as natural experiments, can provide evidence for causal relationships between a gene and a disease outcome. Machine learning approaches are increasingly used to automate and scale causal inference from genetic and clinical data.

Ultimately, computational target identification must be validated experimentally. CRISPR-based gene knockout and knockdown experiments, patient-derived cell models, and animal disease models are used to confirm that modulating a predicted target has the desired therapeutic effect. The gap between computational prediction and biological validation remains significant, and high-confidence computational targets still have a meaningful failure rate in validation experiments.

SharePostShare

Continue reading
TARGETSCREENOPTIMIZETEST

How AI Is Changing Drug Discovery

A stage-by-stage look at where machine learning enters the pharmaceutical pipeline

8 min read
M E T H I O N I N E · A L A · G L Y

Protein Structure Prediction Explained

From the protein folding problem to AlphaFold, Boltz, and co-folding models

10 min read
NOnoisestructure

Generative Models in Drug Design

How diffusion models, VAEs, and language models are designing novel molecules

7 min read
DOCKINGSCOREhit 1hit 2hit 310M3

Virtual Screening and Molecular Docking

How computational methods sift through billions of molecules to find drug candidates

8 min read
LIVERBLOODAabsorbDdistribMmetabE / T

ADMET Prediction with Machine Learning

Why most drug candidates fail and how AI predicts absorption, metabolism, and toxicity early

7 min read
FcVLCLVHCH1CDRCDRAgFabFab

AI-Driven Antibody and Biologics Design

From traditional hybridoma screening to de novo computational antibody generation

9 min read
1DC(=O)Nc1ccc(O)cc1CC#NSMILES2DNOGRAPH3DxyzCOORDS

Molecular Representations for Machine Learning

SMILES, molecular graphs, fingerprints, and 3D coordinates — how molecules become data

7 min read
[CLS]MetAlaGlySerL1L2LN...houtself-attn+ FFNTRANSFORMER ENCODER

Foundation Models in Biology

How protein language models and biological LLMs are creating a new paradigm for drug discovery

9 min read
LARGEPHARMAAI-FIRSTBIOTECHDISCOVERYPLATFORMCLINICALSTAGE

The AI Drug Discovery Landscape

A map of the companies, funding, partnerships, and clinical programs reshaping pharma

10 min read

Stay current

Weekly digest of AI drug discovery developments. No noise.