Chemistry Datasets
Benchmark datasets for molecular property prediction, reactions, and more.
Molecular Property Prediction
| Dataset | Task | Size | Link |
|---|---|---|---|
| MoleculeNet | Multiple benchmark tasks | 700K+ molecules | Paper |
| TDC | Drug discovery benchmarks | Various | tdcommons.ai |
| ZINC | Virtual screening library | 250M+ | zinc.docking.org |
Reactions
| Dataset | Task | Size | Link |
|---|---|---|---|
| USPTO | Reaction prediction | 1M+ reactions | github |
| Open Reaction Database | Open reactions | Growing | open-reaction-database.org |
Dataset Collections
| Resource | Description |
|---|---|
| awesome-chemistry-datasets | Curated list of ML-ready chemistry datasets |
| TDC | Therapeutics Data Commons — drug discovery datasets |
Data Formats
Common formats you’ll encounter:
| Format | Description | Tool |
|---|---|---|
| SMILES | String representation of molecules | RDKit |
| SDF/MOL | 3D structure files | RDKit, OpenBabel |
| InChI | Unique molecular identifier | RDKit |
Loading Data with RDKit
from rdkit import Chem
import pandas as pd
# From SMILES
mol = Chem.MolFromSmiles("CCO")
# From SDF file
suppl = Chem.SDMolSupplier("molecules.sdf")
mols = [mol for mol in suppl if mol is not None]
# From CSV with SMILES column
df = pd.read_csv("molecules.csv")
df["mol"] = df["smiles"].apply(Chem.MolFromSmiles)