Drug-Target Interaction Prediction Task Overview
Definition: The activity of a small-molecule drug is measured by its binding affinity with the target protein. Given a new target protein, the very first step is to screen a set of potential compounds to find their activity. Traditional method to gauge the affinities are through high-throughput screening wet-lab experiments. However, they are very expensive and are thus restricted by their abilities to search over a large set of candidates. Drug-target interaction prediction task aims to predict the interaction activity score in silico given only the accessible compound structural information and protein amino acid sequence.
Impact: Machine learning models that can accurately predict affinities can not only save pharmaceutical research costs on reducing the amount of high-throughput screening, but also to enlarge the search space and avoid missing potential candidates.
Generalization: Models require extrapolation on unseen compounds, unseen proteins, and unseen compound-protein pairs. Models also are expected to have consistent performance across a diverse set of disease and target groups.
Product: Small-molecule.
Pipeline: Activity - hit identification.
BindingDB
Dataset Description: BindingDB is a public, web-accessible database of measured binding affinities, focusing chiefly on the interactions of protein considered to be drug-targets with small, drug-like molecules.
Task Description: Regression. Given the target amino acid sequence/compound SMILES string, predict their binding affinity.
Dataset Statistics: (# of DTI pairs, # of drugs, # of proteins) 52,284/10,665/1,413 for Kd, 991,486/549,205/5,078 for IC50, 375,032/174,662/3,070 for Ki.
Dataset Split: Random Split Cold Drug Split Cold Protein Split
Note: BindingDB is the collection of many assays. Since different assays use different metrics, TDC separates them as separate datasets. Specifically, it has four datasets with Kd, IC50, Ki as the units.
Tips: Transforming to log-scale pIC50, pKi, and pKd can usually lead to more stable training. You can achieve this transformation via here. Checkout the data processing page for binarization, label distribution visualization, edge list/DGL/PyTorch graph transformation.
from tdc.multi_pred import DTI
data = DTI(name = 'BindingDB_Kd')
# data = DTI(name = 'BindingDB_IC50')
# data = DTI(name = 'BindingDB_Ki')
split = data.get_split()
Note: Many DTI pairs have same sequence information but different binding affinities due to different experimental assays. To harmonize them, you can use the below function to retrieve either the maximum affinity or the mean for the duplicated pair:
from tdc.multi_pred import DTI
data = DTI(name = 'BindingDB_Kd')
data.harmonize_affinities(mode = 'max_affinity')
# data.harmonize_affinities(mode = 'mean')
References:
Dataset License: CC BY 3.0 US.
DAVIS
Dataset Description: The interaction of 72 kinase inhibitors with 442 kinases covering >80% of the human catalytic protein kinome.
Task Description: Regression. Given the target amino acid sequence/compound SMILES string, predict their binding affinity.
Dataset Statistics: 0.3.2
Update: 25,772 DTI pairs, 68 drugs, 379 proteins. Before: 27,621 DTI pairs, 68 drugs, 379 proteins.
Dataset Split: Random Split Cold Drug Split Cold Protein Split
Tips: Transforming to log-scale pIC50, pKi, and pKd can usually lead to more stable training. You can achieve this transformation via here. Checkout the data processing page for binarization, label distribution visualization, edge list/DGL/PyTorch graph transformation.
from tdc.multi_pred import DTI
data = DTI(name = 'DAVIS')
split = data.get_split()
References:
Dataset License: Not Specified. CC BY 4.0.
KIBA
Dataset Description: Toward making use of the complementary information captured by the various bioactivity types, including IC50, K(i), and K(d), Tang et al. introduces a model-based integration approach, termed KIBA to generate an integrated drug-target bioactivity matrix.
Task Description: Regression. Given the target amino acid sequence/compound SMILES string, predict their binding affinity.
Dataset Statistics: 0.3.2
Update: 117,657 DTI pairs, 2,068 drugs, 229 proteins. Before: 118,036 DTI pairs, 2,068 drugs, 229 proteins.
Dataset Split: Random Split Cold Drug Split Cold Protein Split
Tips: Checkout the data processing page for binarization, label distribution visualization, edge list/DGL/PyTorch graph transformation.
from tdc.multi_pred import DTI
data = DTI(name = 'KIBA')
split = data.get_split()
References:
Dataset License: Not Specified. CC BY 4.0.