Skip to content

Quickstart

This guide walks through the end-to-end pipeline: generating molecular descriptors, running benchmarking, and analyzing results.

1. Prepare Input Data

Input data should be in HDF5 format (.h5) with the following structure:

dataset.h5
├── dataset/
│   ├── smi          # SMILES strings (required)
│   └── dataset      # Dataset identifiers
└── features/
    ├── mfp_r2_1024  # Morgan fingerprints (N x 1024)
    ├── maccs_keys   # MACCS keys (N x 167)
    └── embed        # ChemDist embeddings (N x 16)

Sample datasets are provided in the datasets/ directory (e.g., CHEMBL204.h5).

2. Generate Descriptors (Optional)

If your data only contains SMILES strings, generate molecular descriptors first:

python scripts/generate_descriptors.py bench_configs/features.toml

Edit bench_configs/features.toml to configure input/output paths and descriptor types. See the Configuration Reference for all options.

3. Run Benchmarking

Edit bench_configs/run_benchmarking.toml to set your data path and desired methods:

data_path = "datasets/CHEMBL204.h5"
output_dir = "results/my_run"
methods = ["UMAP", "t-SNE", "GTM", "PCA"]
n_components = 2
k_neighbors = [2, 5, 10, 20, 50]
k_hit = 20
optimization_type = "insample"
scaling = "standardize"
similarity_metric = "euclidean"
sample_size = 2500
test = false
plot_data = true

Run the benchmarking:

python scripts/run_benchmarking.py --config bench_configs/run_benchmarking.toml

Test mode

Set test = true to run with a single parameter combination per method. This is useful for verifying your setup before a full grid search.

This performs a grid search over hyperparameters for each method, evaluates quality metrics, and saves results to HDF5 files in the output directory.

4. Analyze Results

Aggregate metrics across datasets and generate summary tables and plots:

python scripts/analyze_results.py \
    --input_dir results/my_run \
    --output_dir results/my_run \
    --k_hit 20

This produces:

  • CSV files with per-dataset metrics
  • PNG comparison plots
  • DOCX summary tables

What's Next