Skip to Content
BiologyBiology Datasets

Biology Datasets

Benchmark datasets for drug discovery, protein science, and computational biology.

Drug Discovery

DatasetTaskSizeLink
TDCDrug discovery benchmarksVarioustdcommons.ai 
MoleculeNetADMET, toxicity700K+ moleculesmoleculenet.ai 
ChEMBLBioactivity data2M+ compoundsebi.ac.uk/chembl 
BindingDBDrug-target binding data2.6M+ entriesbindingdb.org 
PDBbindProtein-ligand affinity23K+ complexespdbbind.org.cn 
DrugBankApproved and investigational drugs15K+ drugsdrugbank.com 
ZINCVirtual screening compounds250M+zinc.docking.org 

Proteins

DatasetTaskSizeLink
AlphaFold DBPredicted protein structures200M+ structuresalphafold.ebi.ac.uk 
PDBExperimental structures200K+rcsb.org 
UniProtProtein sequences & functions250M+uniprot.org 
CATHProtein domain classification-cathdb.info 
Human Protein AtlasProtein expression data-proteinatlas.org 
UniclustClustered protein sequences-uniclust.mmseqs.com 

Single-Cell & Genomics

DatasetDescriptionLink
Berkeley Drosophila Genome ProjectDrosophila genome annotations and dataml4sci 
Gene Expression OmnibusPublic functional genomics datancbi.nlm.nih.gov/geo 
Single Cell PortalSingle cell RNA datasinglecell.broadinstitute.org 
Single Cell Expression AtlasscRNA atlasebi.ac.uk/gxa/sc 
10x Genomics DatasetsSingle-cell datasets10xgenomics.com 
GTExGene expression and regulationgtexportal.org 
DepMapCRISPR screens in cancer cellsdepmap.org 
Open ProblemsStandardized benchmarks for single-cell tasks (batch integration, denoising, label projection, etc.)openproblems.bio 

Drug Response & Screening

DatasetDescriptionLink
Open MSIMass spectrometry imaging data and analysisml4sci 
NCI60Cancer cell line screeningdtp.cancer.gov 
GDSCGenomics of Drug Sensitivitycancerrxgene.org 
CCLECancer Cell Line Encyclopediabroadinstitute.org/ccle 
CellMinerCDBIntegrated cell line databasesdiscover.nci.nih.gov 

Pathways & Interactions

DatasetDescriptionLink
KEGG PATHWAYBiological pathway mapsgenome.jp/kegg 
WikiPathwaysCommunity pathway databasewikipathways.org 
PathwayCommonsPathway and interaction datapathwaycommons.org 
STRINGProtein-protein interactionsstring-db.org 
BioGRIDInteraction databasethebiogrid.org 
STITCHChemical-protein interactionsstitch.embl.de 

Disease & Clinical

DatasetDescriptionLink
COSMICSomatic mutations in cancercancer.sanger.ac.uk 
cBioPortalCancer genomics portalcbioportal.org 
ClinicalTrials.govClinical study databaseclinicaltrials.gov 
DisGeNETGene-disease associationsdisgenet.com 
MIMIC-IVCritical care databasemimic.mit.edu 

Knowledge Graphs

DatasetDescriptionLink
DRKGDrug Repurposing Knowledge Graphgithub 
DrugMechDBDrug mechanism databasegithub 
CTDChemical-gene-disease associationsctdbase.org 

Awesome Lists

ResourceDescription
awesome-computational-biology Comprehensive computational biology resources
TDC Therapeutics Data Commons
awesome-small-molecule-ml Drug discovery datasets