Coverage

What OpenAlgo supports today

We are honest about what works and what doesn't. This page shows the maturity of each template family we use to generate code from QSAR and molecular property prediction papers. If a domain isn't listed here, we don't support it yet — and we'd rather say so upfront than produce unreliable output.

Template families

Fingerprint / descriptor classification

Stable

Binary endpoint prediction using molecular fingerprints (Morgan, MACCS) and classical descriptors with Random Forest, SVM, and XGBoost classifiers. Covers the most common QSAR classification workflow in published literature.

Librariesscikit-learnRDKitXGBoost

Known limitations

—Multi-class targets require manual configuration after generation.
—Custom descriptor calculators beyond RDKit built-ins are not yet supported.

Example papers

“Predicting hERG channel blockers using Morgan fingerprints and random forest ensembles”
“MACCS-key SVM models for Ames mutagenicity prediction”

Fingerprint / descriptor regression

Stable

Continuous target prediction (solubility, logP, pIC50) using the same featurization pipeline as classification but with regression heads. Supports standard error metrics and applicability domain estimation.

Librariesscikit-learnRDKitXGBoost

Known limitations

—Multi-task regression (simultaneous prediction of multiple endpoints) is not yet supported.
—Uncertainty quantification is limited to ensemble variance; conformal prediction is on the roadmap.

Example papers

“Aqueous solubility prediction with extended-connectivity fingerprints and gradient-boosted trees”
“Random forest regression models for lipophilicity using 2D molecular descriptors”

Graph neural network classification

Beta

Message-passing neural networks (MPNN) and graph convolutional networks (GCN) for binary classification directly on molecular graphs. Handles atom and bond features from standard featurization.

LibrariesPyTorch GeometricRDKitDeepChem

Known limitations

—Does not yet support edge features in message passing.
—Attention-based pooling variants (GAT, GATv2) are experimental.
—Training hyperparameters are set to sensible defaults; full hyperparameter search scaffolding is planned.

Example papers

“Graph convolutional networks for toxicity prediction on ToxCast endpoints”
“MPNN-based virtual screening for kinase inhibitors”

Graph neural network regression

Beta

Same graph architecture as GNN classification but configured for continuous targets. Supports standard regression losses and evaluation metrics for molecular property prediction.

LibrariesPyTorch GeometricRDKitDeepChem

Known limitations

—Edge features and attention-based pooling share the same limitations as GNN classification.
—Transfer learning from pre-trained graph models is not yet integrated.
—Large-scale datasets (>500k molecules) may require manual batch-size tuning.

Example papers

“Predicting aqueous solubility with message-passing neural networks”
“GCN regression models for binding affinity on PDBbind”

Evaluation / split wrapper

Stable

Reusable scaffolding for dataset splitting and evaluation. Includes scaffold split, temporal split, and stratified k-fold with proper leakage prevention. Designed to wrap any of the above template families.

Librariesscikit-learnRDKitDeepChem

Known limitations

—Custom split functions must follow a specific callable signature; documentation for this is being expanded.
—Temporal splits require an explicit date column in the source data.

Example papers

“Impact of scaffold splitting on predictive performance in molecular property models”
“Avoiding data leakage in QSAR: a benchmark of splitting strategies”

What's not supported (yet)

We intentionally exclude the following domains. In each case we either cannot produce reliable output or the workflow requires tooling that sits outside our current architecture.

Molecular docking & scoringDocking workflows depend on protein structures and specific docking engines (AutoDock, Glide) that sit outside our current extraction pipeline.

Molecular dynamics simulationsMD requires force-field parameterization and GPU-specific configuration that we cannot reliably generate from paper descriptions alone.

Genomics & sequence modelsDNA/RNA/protein sequence prediction uses fundamentally different data loaders and architectures. We plan to revisit once the molecular domain is mature.

Wet-lab protocolsExperimental protocols involve physical processes that cannot be translated to code. We focus exclusively on computational workflows.

Novel architectures not in standard librariesIf a paper introduces a custom layer, loss function, or training loop that isn't available in PyTorch Geometric, DeepChem, or scikit-learn, we cannot yet generate it automatically.

Have a paper that doesn't fit any of these templates?

Browse hub examples Let us know