Skip to Content
ChemistryChemistry Datasets

Chemistry Datasets

Benchmark datasets for molecular property prediction, reactions, and more.

Benchmark Suites

DatasetDescriptionLink
MoleculeNetComprehensive benchmark suite with multiple tasksmoleculenet.org 
TDCTherapeutics Data Commons — drug discovery datasetstdcommons.ai 
OCHEM3.77M records for 689 chemical properties (CC BY 4.0)ochem.eu 

Molecular Structure Databases

DatasetDescriptionLink
ZINC20Chemical library for deep docking virtual screeningzinc20-ML 
ZINC22Commercially-available compounds for virtual screeningcartblanche22 
COCONUTOpen-source natural products databasecoconut 
Crystallography Open DatabaseOpen-access crystal structurescod 
GDBEnumerated molecules following feasibility rulesgdb 
Enamine HTS1.93 million diverse screening compoundsenamine 
GNPSMass spectrometry database for natural productsgnps 
MoNAMass spectrometry database with real and predicted spectramona 
nmrshiftdb2Organic structures linked with NMR spectral datanmrshiftdb2 
nCov-Group DataSMILES, fingerprints, descriptors for millions of compoundsncov 

Property Prediction Benchmarks

DatasetPropertyLink
AquaSolDBCurated aqueous solubility dataharvard dataverse 
BigSolDB 2.0103,944 solubility values across solvents/temperatureszenodo 
ESolWater solubility for organic small moleculespubmed 
FreeSolvExperimental and calculated hydration free energiesgithub 
LipophilicityOctanol/water distribution coefficients at pH 7.4deepchem 
Flashpoint10,575 molecular flashpoint valuesgithub 
Harvard OPVPhotovoltaic data with quantum-chemical calculationsfigshare 
ILThermoIonic liquid thermodynamic and transport propertiesnist 
Photoswitch Dataset405 curated photoswitch moleculesgithub 
Leffingwell Odor3,523 molecules with expert-labeled odor descriptorszenodo 
Limiting Activity CoefficientsSolvent/solute pair datapolybox 

QM Datasets

DatasetDescriptionLink
QM7/QM7bSmall organic molecules with propertiesquantum-machine.org 
QM8Electronic spectra and excited state energiesquantum-machine.org 
QM9134K molecules with 13 properties from DFTquantum-machine.org 
MD TrajectoriesMolecular dynamics simulation dataquantum-machine.org 
SolProp1 million COSMO-RS calculations and solvation datamit 
SOMASSolubility for redox-flow battery designfigshare 

Bioactivity & Drug Discovery

DatasetDescriptionLink
BindingDB2.6M binding data entries for molecular recognitionbindingdb 
ChEBI-2033,010 molecule-description pairs for captioningpaperswithcode 
PapyrusCurated bioactivity combining ChEMBL and ExCAPE-DB4tu 
LIT-PCBA15 target sets, 7761 actives, 382674 inactivesdrugdesign 
MPCD39 datasets for activity predictiongithub 
MoleculeACEBenchmark for activity cliff compoundsgithub 
ACNet400K Matched Molecular Pairs against 190 targetsacnet 

Reaction Datasets

DatasetDescriptionLink
USPTOReactions from US patents 1976-2016figshare 
Open Reaction DatabaseStandardized open reaction dataord 
RDB7Atom-mapped SMILES with barrier heights and enthalpieszenodo 
Dreher-Doyle3,955 Pd-catalyzed Buchwald-Hartwig reactionsgithub 
Perera (Suzuki)5,760 Pd-catalyzed Suzuki-Miyaura reactionsgithub 

ADME & Pharmacology

DatasetDescriptionLink
EPA CompToxChemistry, toxicity, exposure for 100K+ chemicalsepa 
SIDERDrug side effects (CC BY-NC-SA 4.0)embl 
Caco-2 PermeabilityDrug absorption through intestinal tissuefigshare 
PAMPA PermeabilityDrug permeability assay datadoi 
KEGG PATHWAYBiological functions from molecular-level datakegg 
LOTUS750,000+ structure-organism pairszenodo 
MetXBioDBBiotransformation reactions and metaboliteszenodo 
ONSIDESAdverse drug effects from FDA labelsgithub 
HMDBSmall molecule metabolites in human bodyhmdb 
Guide to PHARMACOLOGYExpert-curated ligand-activity-target (CC BY-SA 4.0)iuphar 
Clinical TrialsAll study records from ClinicalTrials.govct.gov 
KD-DTIDrug-target-interaction tripletsgithub 
MetrabaseHuman small molecule metabolism and transportcam 
Open TargetsGenetics/genomics for target identificationopentargets 
Probes & DrugsBioactive compound librariesprobes-drugs 
DDI DatasetMedLine/DrugBank documents on drug interactionspaperswithcode 

Text & NLP Datasets

DatasetDescriptionLink
PubMedAbstracts and citations from biomedical literaturepubmed 
PubMed CentralFree full-text biomedical article archivepmc 
S2ORC81.1M English academic papers (CC BY-NC 4.0)github 
ChemTables788 chemical patent tables with labels (CC BY NC 3.0)mendeley 
NLMChem150 manually annotated full-text articles on chemicalsnlm 
PubChemSTM281K chemical structure and text paired dataarxiv 
Elsevier Corpus40,001 open-access CC-BY articleselsevier 
Europe PMCBulk download of 5M+ full-text articleseuropepmc 
BioRxiv XMLFull-text bioRxiv via Amazon S3biorxiv 
MedRxiv XMLMedical research articles via S3medrxiv 
BC5CDR1500 PubMed articles with annotated chemicals/diseasespaperswithcode 
PubMedQAQA dataset with 1K expert labels, 273.5K instancespubmedqa 

LLM Training Datasets

DatasetDescriptionSizeLink
ChemPileMixture-of-expert chemical corpus75B+ tokenshuggingface 
SmolInstructInstruction dataset from 15 chemistry tasks3.3M pairshuggingface 
CAMEL ChemistryGPT-4 conversation pairs about chemistry20K sampleshuggingface 
ChemNLPCurated text for chemistry LLMs80K recordshuggingface 
ChemLLMBenchChemistry evaluation benchmark for LLMs8 tasksgithub 
SciCodeScientific coding reasoning benchmark338 subproblemsgithub 
ChemData 700KChemistry question-answer pairs700K pairshuggingface 
MatSci-InstructMaterials science instruction data52K sampleshuggingface 
MoleculeQAMolecular understanding benchmark62K QA pairsgithub 
ChemBench7-domain benchmarking across chemistry2.7K questionsgithub 
MatTextBenchmarking language representations of materials-github 
MegaScienceMulti-domain science question answering-huggingface 
ZINC20-MLDeep-learning-ready ZINC20 formats300M+zinc20-ML 

Literature-Mined Datasets

DatasetDescriptionSizeLink
PubChemChemical information from 750+ data sources116M compoundspubchem 
Open Reaction DatabaseStandardized chemical reaction data1M+ reactionsord 
PatCIDChemical structures from patent images81M imagesgithub 
MatScholarNLP-extracted materials entities5M+ abstractsmatscholar 
L2M3Literature-mined materials-morphology database51K entriesmatscholar 
ChemDataExtractorToolkit for auto-extracting chemical data-chemdataextractor 

Reference Resources

ResourceDescriptionLink
IUPAC Gold BookChemistry terminology and definitionsiupac 
LibreText ChemistryOpen-access chemistry textbooklibretexts 
OpenStax Chemistry 2eFree textbook (CC-BY 4.0)openstax 
ThermoML ArchiveThermophysical property datanist 

Dataset Collections

ResourceDescription
awesome-matchem-datasets Materials & chemistry datasets (Blaiszik)
awesome-chemistry-datasets Curated list of ML-ready chemistry datasets
TDC Therapeutics Data Commons — drug discovery datasets

Data Formats

Common formats you’ll encounter:

FormatDescriptionTool
SMILESString representation of moleculesRDKit
SDF/MOL3D structure filesRDKit, OpenBabel
InChIUnique molecular identifierRDKit

Loading Data with RDKit

from rdkit import Chem import pandas as pd # From SMILES mol = Chem.MolFromSmiles("CCO") # From SDF file suppl = Chem.SDMolSupplier("molecules.sdf") mols = [mol for mol in suppl if mol is not None] # From CSV with SMILES column df = pd.read_csv("molecules.csv") df["mol"] = df["smiles"].apply(Chem.MolFromSmiles)