CLI Reference¶
All scripts are located in the scripts/ directory.
run_benchmarking.py¶
Main script for grid search optimization of DR methods on chemical datasets.
| Argument | Required | Description |
|---|---|---|
--config |
Yes | Path to TOML configuration file |
What it does:
- Loads HDF5 datasets (single file or directory)
- For each dataset and feature type: removes duplicates, scales features, computes PCA
- Calculates ambient-space distance matrices and k-NN indices
- Runs grid search optimization for each specified method
- Computes quality metrics (NN overlap, co-ranking measures, trustworthiness, continuity)
- Saves results to HDF5
Output: One HDF5 file per dataset/feature combination in output_dir, containing coordinates and metrics for each method.
See Configuration Reference for all config options.
generate_descriptors.py¶
Generates molecular descriptors (fingerprints and embeddings) from SMILES.
| Argument | Required | Description |
|---|---|---|
config_file |
Yes | Path to TOML configuration file (positional argument) |
What it does:
- Loads SMILES files matching the configured pattern
- Generates Morgan fingerprints, MACCS keys, and/or ChemDist embeddings
- Removes duplicate molecules and constant features
- Saves to HDF5 with hierarchical structure
Output: HDF5 files in output_path with dataset/ and features/ groups.
See Feature Generation Config for all config options.
analyze_results.py¶
Aggregates optimization results across datasets and generates summary statistics and visualizations.
python scripts/analyze_results.py \
--input_dir <input_directory> \
--output_dir <output_directory> \
--k_hit <k_value> \
--separate <bool>
| Argument | Required | Description |
|---|---|---|
--input_dir |
Yes | Directory containing optimization result HDF5 files |
--output_dir |
No | Output directory for statistics (defaults to input_dir) |
--k_hit |
No | k value for PNN metric reporting |
--separate |
No | Whether to save results separately per dataset |
Output:
stats_by_dataset/{dataset}_metrics.csv-- Per-dataset metrics{descriptor}_final_metrics.pkl-- Pickled aggregated metrics{descriptor}.png-- Comparison plots{descriptor}.docx-- Formatted results tables
prepare_lolo.py¶
Prepares leave-one-library-out (LOLO) train/test splits for cross-validation studies.
| Argument | Required | Description |
|---|---|---|
input_file |
Yes | Path to input HDF5 file with combined datasets |
output_dir |
Yes | Directory to save split datasets |
What it does:
For each unique dataset identifier in the combined file:
- Creates a
leave_out_{dataset}/directory - Saves the left-out compounds as
{dataset}.h5 - Saves the remaining compounds (minus exact duplicates) as
{dataset}_out.h5
analyze_lib_distance_preservation.py¶
Analyzes how well DR methods preserve pairwise distances between molecules.
python scripts/analyze_lib_distance_preservation.py \
--input_dir <input_directory> \
--output_dir <output_directory> \
--similarity_metric <metric>
| Argument | Required | Default | Description |
|---|---|---|---|
--input_dir |
Yes | -- | Directory containing optimization results |
--output_dir |
No | Same as input_dir |
Output directory for distance statistics |
--similarity_metric |
No | "euclidean" |
Distance metric: "euclidean" or "tanimoto" |
Output:
distances_by_dataset/{dataset}_distances.csv-- Per-dataset pairwise distances{descriptor}_final_distances.pkl-- Aggregated distance statistics