Skip to Content
ChemistryChemistry Datasets

Chemistry Datasets

Benchmark datasets for molecular property prediction, reactions, and more.

Molecular Property Prediction

DatasetTaskSizeLink
MoleculeNetMultiple benchmark tasks700K+ moleculesPaper 
TDCDrug discovery benchmarksVarioustdcommons.ai 
ZINCVirtual screening library250M+zinc.docking.org 

Reactions

DatasetTaskSizeLink
USPTOReaction prediction1M+ reactionsgithub 
Open Reaction DatabaseOpen reactionsGrowingopen-reaction-database.org 

Dataset Collections

ResourceDescription
awesome-chemistry-datasets Curated list of ML-ready chemistry datasets
TDC Therapeutics Data Commons — drug discovery datasets

Data Formats

Common formats you’ll encounter:

FormatDescriptionTool
SMILESString representation of moleculesRDKit
SDF/MOL3D structure filesRDKit, OpenBabel
InChIUnique molecular identifierRDKit

Loading Data with RDKit

from rdkit import Chem import pandas as pd # From SMILES mol = Chem.MolFromSmiles("CCO") # From SDF file suppl = Chem.SDMolSupplier("molecules.sdf") mols = [mol for mol in suppl if mol is not None] # From CSV with SMILES column df = pd.read_csv("molecules.csv") df["mol"] = df["smiles"].apply(Chem.MolFromSmiles)