Biology Datasets
Benchmark datasets for drug discovery, protein science, and computational biology.
Drug Discovery
| Dataset | Task | Size | Link |
|---|---|---|---|
| TDC | Drug discovery benchmarks | Various | tdcommons.ai |
| MoleculeNet | ADMET, toxicity | 700K+ molecules | moleculenet.ai |
| ChEMBL | Bioactivity data | 2M+ compounds | ebi.ac.uk/chembl |
| BindingDB | Drug-target binding data | 2.6M+ entries | bindingdb.org |
| PDBbind | Protein-ligand affinity | 23K+ complexes | pdbbind.org.cn |
| DrugBank | Approved and investigational drugs | 15K+ drugs | drugbank.com |
| ZINC | Virtual screening compounds | 250M+ | zinc.docking.org |
Proteins
| Dataset | Task | Size | Link |
|---|---|---|---|
| AlphaFold DB | Predicted protein structures | 200M+ structures | alphafold.ebi.ac.uk |
| PDB | Experimental structures | 200K+ | rcsb.org |
| UniProt | Protein sequences & functions | 250M+ | uniprot.org |
| CATH | Protein domain classification | - | cathdb.info |
| Human Protein Atlas | Protein expression data | - | proteinatlas.org |
| Uniclust | Clustered protein sequences | - | uniclust.mmseqs.com |
Single-Cell & Genomics
| Dataset | Description | Link |
|---|---|---|
| Berkeley Drosophila Genome Project | Drosophila genome annotations and data | ml4sci |
| Gene Expression Omnibus | Public functional genomics data | ncbi.nlm.nih.gov/geo |
| Single Cell Portal | Single cell RNA data | singlecell.broadinstitute.org |
| Single Cell Expression Atlas | scRNA atlas | ebi.ac.uk/gxa/sc |
| 10x Genomics Datasets | Single-cell datasets | 10xgenomics.com |
| GTEx | Gene expression and regulation | gtexportal.org |
| DepMap | CRISPR screens in cancer cells | depmap.org |
| Open Problems | Standardized benchmarks for single-cell tasks (batch integration, denoising, label projection, etc.) | openproblems.bio |
Drug Response & Screening
| Dataset | Description | Link |
|---|---|---|
| Open MSI | Mass spectrometry imaging data and analysis | ml4sci |
| NCI60 | Cancer cell line screening | dtp.cancer.gov |
| GDSC | Genomics of Drug Sensitivity | cancerrxgene.org |
| CCLE | Cancer Cell Line Encyclopedia | broadinstitute.org/ccle |
| CellMinerCDB | Integrated cell line databases | discover.nci.nih.gov |
Pathways & Interactions
| Dataset | Description | Link |
|---|---|---|
| KEGG PATHWAY | Biological pathway maps | genome.jp/kegg |
| WikiPathways | Community pathway database | wikipathways.org |
| PathwayCommons | Pathway and interaction data | pathwaycommons.org |
| STRING | Protein-protein interactions | string-db.org |
| BioGRID | Interaction database | thebiogrid.org |
| STITCH | Chemical-protein interactions | stitch.embl.de |
Disease & Clinical
| Dataset | Description | Link |
|---|---|---|
| COSMIC | Somatic mutations in cancer | cancer.sanger.ac.uk |
| cBioPortal | Cancer genomics portal | cbioportal.org |
| ClinicalTrials.gov | Clinical study database | clinicaltrials.gov |
| DisGeNET | Gene-disease associations | disgenet.com |
| MIMIC-IV | Critical care database | mimic.mit.edu |
Knowledge Graphs
| Dataset | Description | Link |
|---|---|---|
| DRKG | Drug Repurposing Knowledge Graph | github |
| DrugMechDB | Drug mechanism database | github |
| CTD | Chemical-gene-disease associations | ctdbase.org |
Awesome Lists
| Resource | Description |
|---|---|
| awesome-computational-biology | Comprehensive computational biology resources |
| TDC | Therapeutics Data Commons |
| awesome-small-molecule-ml | Drug discovery datasets |