Drug Response Prediction Task Overview
Definition: The same drug compound could have various levels of responses in different patients. To design drug for individual or a group with certain characteristics is the central goal of precision medicine. For example, the same anti-cancer drug could have various responses to different cancer cell lines. This task aims to predict the drug response rate given a pair of drug and the cell line genomics profile.
Impact: The combinations of available drugs and all types of cell line genomics profiles are very large while to test each combination in the wet lab is prohibitively expensive. A machine learning model that can accurately predict a drug's response given various cell lines in silico can thus make the combination search feasible and greatly reduce the burdens on experiments. The fast prediction speed also allows us to screen a large set of drugs to circumvent the potential missing potent drugs.
Generalization: A model trained on existing drug cell-line pair should be able to predict accurately on new set of drugs and cell-lines. This requires a model to learn the biochemical knowledge instead of memorizing the training pairs.
Product: Small-molecule.
Pipeline: Activity.
GDSC1
Dataset Description: Genomics in Drug Sensitivity in Cancer (GDSC) is a resource for therapeutic biomarker discovery in cancer cells. It contains wet lab IC50 for 100s of drugs in 1000 cancer cell lines. In this dataset, we use RMD normalized gene expression for cancer lines and SMILES for drugs. Y is the log normalized IC50. This is the version 1 of GDSC.
Task Description: Regression. Given the gene expression of cell lines and the SMILES of drug, predict the drug sensitivity level.
Dataset Statistics: 177,310 pairs, 958 cancer cells and 208 drugs
Dataset Split: Random Split
from tdc.multi_pred import DrugRes
data = DrugRes(name = 'GDSC1')
split = data.get_split()
To get the gene names for expressions of cell lines, you can use:
data.get_gene_symbols()
References:
Dataset License: CC BY-NC-ND 2.5.
GDSC2
Dataset Description: Genomics in Drug Sensitivity in Cancer (GDSC) is a resource for therapeutic biomarker discovery in cancer cells. It contains wet lab IC50 for 100s of drugs in 1000 cancer cell lines. In this dataset, we use RMD normalized gene expression for cancer lines and SMILES for drugs. Y is the log normalized IC50. This is the version 2 of GDSC, which uses improved experimental procedures.
Task Description: Regression. Given the gene expression of cell lines and the SMILES of drug, predict the drug sensitivity level.
Dataset Statistics: 92,703 pairs, 805 cancer cells and 137 drugs
Dataset Split: Random Split
from tdc.multi_pred import DrugRes
data = DrugRes(name = 'GDSC2')
split = data.get_split()
References:
Dataset License: CC BY-NC-ND 2.5.