Chemistry Datasets
Benchmark datasets for molecular property prediction, reactions, and more.
Benchmark Suites
| Dataset | Description | Link |
|---|---|---|
| MoleculeNet | Comprehensive benchmark suite with multiple tasks | moleculenet.org |
| TDC | Therapeutics Data Commons — drug discovery datasets | tdcommons.ai |
| OCHEM | 3.77M records for 689 chemical properties (CC BY 4.0) | ochem.eu |
Molecular Structure Databases
| Dataset | Description | Link |
|---|---|---|
| ZINC20 | Chemical library for deep docking virtual screening | zinc20-ML |
| ZINC22 | Commercially-available compounds for virtual screening | cartblanche22 |
| COCONUT | Open-source natural products database | coconut |
| Crystallography Open Database | Open-access crystal structures | cod |
| GDB | Enumerated molecules following feasibility rules | gdb |
| Enamine HTS | 1.93 million diverse screening compounds | enamine |
| GNPS | Mass spectrometry database for natural products | gnps |
| MoNA | Mass spectrometry database with real and predicted spectra | mona |
| nmrshiftdb2 | Organic structures linked with NMR spectral data | nmrshiftdb2 |
| nCov-Group Data | SMILES, fingerprints, descriptors for millions of compounds | ncov |
Property Prediction Benchmarks
| Dataset | Property | Link |
|---|---|---|
| AquaSolDB | Curated aqueous solubility data | harvard dataverse |
| BigSolDB 2.0 | 103,944 solubility values across solvents/temperatures | zenodo |
| ESol | Water solubility for organic small molecules | pubmed |
| FreeSolv | Experimental and calculated hydration free energies | github |
| Lipophilicity | Octanol/water distribution coefficients at pH 7.4 | deepchem |
| Flashpoint | 10,575 molecular flashpoint values | github |
| Harvard OPV | Photovoltaic data with quantum-chemical calculations | figshare |
| ILThermo | Ionic liquid thermodynamic and transport properties | nist |
| Photoswitch Dataset | 405 curated photoswitch molecules | github |
| Leffingwell Odor | 3,523 molecules with expert-labeled odor descriptors | zenodo |
| Limiting Activity Coefficients | Solvent/solute pair data | polybox |
QM Datasets
| Dataset | Description | Link |
|---|---|---|
| QM7/QM7b | Small organic molecules with properties | quantum-machine.org |
| QM8 | Electronic spectra and excited state energies | quantum-machine.org |
| QM9 | 134K molecules with 13 properties from DFT | quantum-machine.org |
| MD Trajectories | Molecular dynamics simulation data | quantum-machine.org |
| SolProp | 1 million COSMO-RS calculations and solvation data | mit |
| SOMAS | Solubility for redox-flow battery design | figshare |
Bioactivity & Drug Discovery
| Dataset | Description | Link |
|---|---|---|
| BindingDB | 2.6M binding data entries for molecular recognition | bindingdb |
| ChEBI-20 | 33,010 molecule-description pairs for captioning | paperswithcode |
| Papyrus | Curated bioactivity combining ChEMBL and ExCAPE-DB | 4tu |
| LIT-PCBA | 15 target sets, 7761 actives, 382674 inactives | drugdesign |
| MPCD | 39 datasets for activity prediction | github |
| MoleculeACE | Benchmark for activity cliff compounds | github |
| ACNet | 400K Matched Molecular Pairs against 190 targets | acnet |
Reaction Datasets
| Dataset | Description | Link |
|---|---|---|
| USPTO | Reactions from US patents 1976-2016 | figshare |
| Open Reaction Database | Standardized open reaction data | ord |
| RDB7 | Atom-mapped SMILES with barrier heights and enthalpies | zenodo |
| Dreher-Doyle | 3,955 Pd-catalyzed Buchwald-Hartwig reactions | github |
| Perera (Suzuki) | 5,760 Pd-catalyzed Suzuki-Miyaura reactions | github |
ADME & Pharmacology
| Dataset | Description | Link |
|---|---|---|
| EPA CompTox | Chemistry, toxicity, exposure for 100K+ chemicals | epa |
| SIDER | Drug side effects (CC BY-NC-SA 4.0) | embl |
| Caco-2 Permeability | Drug absorption through intestinal tissue | figshare |
| PAMPA Permeability | Drug permeability assay data | doi |
| KEGG PATHWAY | Biological functions from molecular-level data | kegg |
| LOTUS | 750,000+ structure-organism pairs | zenodo |
| MetXBioDB | Biotransformation reactions and metabolites | zenodo |
| ONSIDES | Adverse drug effects from FDA labels | github |
| HMDB | Small molecule metabolites in human body | hmdb |
| Guide to PHARMACOLOGY | Expert-curated ligand-activity-target (CC BY-SA 4.0) | iuphar |
| Clinical Trials | All study records from ClinicalTrials.gov | ct.gov |
| KD-DTI | Drug-target-interaction triplets | github |
| Metrabase | Human small molecule metabolism and transport | cam |
| Open Targets | Genetics/genomics for target identification | opentargets |
| Probes & Drugs | Bioactive compound libraries | probes-drugs |
| DDI Dataset | MedLine/DrugBank documents on drug interactions | paperswithcode |
Text & NLP Datasets
| Dataset | Description | Link |
|---|---|---|
| PubMed | Abstracts and citations from biomedical literature | pubmed |
| PubMed Central | Free full-text biomedical article archive | pmc |
| S2ORC | 81.1M English academic papers (CC BY-NC 4.0) | github |
| ChemTables | 788 chemical patent tables with labels (CC BY NC 3.0) | mendeley |
| NLMChem | 150 manually annotated full-text articles on chemicals | nlm |
| PubChemSTM | 281K chemical structure and text paired data | arxiv |
| Elsevier Corpus | 40,001 open-access CC-BY articles | elsevier |
| Europe PMC | Bulk download of 5M+ full-text articles | europepmc |
| BioRxiv XML | Full-text bioRxiv via Amazon S3 | biorxiv |
| MedRxiv XML | Medical research articles via S3 | medrxiv |
| BC5CDR | 1500 PubMed articles with annotated chemicals/diseases | paperswithcode |
| PubMedQA | QA dataset with 1K expert labels, 273.5K instances | pubmedqa |
LLM Training Datasets
| Dataset | Description | Size | Link |
|---|---|---|---|
| ChemPile | Mixture-of-expert chemical corpus | 75B+ tokens | huggingface |
| SmolInstruct | Instruction dataset from 15 chemistry tasks | 3.3M pairs | huggingface |
| CAMEL Chemistry | GPT-4 conversation pairs about chemistry | 20K samples | huggingface |
| ChemNLP | Curated text for chemistry LLMs | 80K records | huggingface |
| ChemLLMBench | Chemistry evaluation benchmark for LLMs | 8 tasks | github |
| SciCode | Scientific coding reasoning benchmark | 338 subproblems | github |
| ChemData 700K | Chemistry question-answer pairs | 700K pairs | huggingface |
| MatSci-Instruct | Materials science instruction data | 52K samples | huggingface |
| MoleculeQA | Molecular understanding benchmark | 62K QA pairs | github |
| ChemBench | 7-domain benchmarking across chemistry | 2.7K questions | github |
| MatText | Benchmarking language representations of materials | - | github |
| MegaScience | Multi-domain science question answering | - | huggingface |
| ZINC20-ML | Deep-learning-ready ZINC20 formats | 300M+ | zinc20-ML |
Literature-Mined Datasets
| Dataset | Description | Size | Link |
|---|---|---|---|
| PubChem | Chemical information from 750+ data sources | 116M compounds | pubchem |
| Open Reaction Database | Standardized chemical reaction data | 1M+ reactions | ord |
| PatCID | Chemical structures from patent images | 81M images | github |
| MatScholar | NLP-extracted materials entities | 5M+ abstracts | matscholar |
| L2M3 | Literature-mined materials-morphology database | 51K entries | matscholar |
| ChemDataExtractor | Toolkit for auto-extracting chemical data | - | chemdataextractor |
Reference Resources
| Resource | Description | Link |
|---|---|---|
| IUPAC Gold Book | Chemistry terminology and definitions | iupac |
| LibreText Chemistry | Open-access chemistry textbook | libretexts |
| OpenStax Chemistry 2e | Free textbook (CC-BY 4.0) | openstax |
| ThermoML Archive | Thermophysical property data | nist |
Dataset Collections
| Resource | Description |
|---|---|
| awesome-matchem-datasets | Materials & chemistry datasets (Blaiszik) |
| awesome-chemistry-datasets | Curated list of ML-ready chemistry datasets |
| TDC | Therapeutics Data Commons — drug discovery datasets |
Data Formats
Common formats you’ll encounter:
| Format | Description | Tool |
|---|---|---|
| SMILES | String representation of molecules | RDKit |
| SDF/MOL | 3D structure files | RDKit, OpenBabel |
| InChI | Unique molecular identifier | RDKit |
Loading Data with RDKit
from rdkit import Chem
import pandas as pd
# From SMILES
mol = Chem.MolFromSmiles("CCO")
# From SDF file
suppl = Chem.SDMolSupplier("molecules.sdf")
mols = [mol for mol in suppl if mol is not None]
# From CSV with SMILES column
df = pd.read_csv("molecules.csv")
df["mol"] = df["smiles"].apply(Chem.MolFromSmiles)