Skip to content

Configuration Reference

All scripts are driven by TOML configuration files located in bench_configs/.

Benchmarking Config

File: bench_configs/run_benchmarking.toml

Used by: scripts/run_benchmarking.py --config <path>

Required Keys

Key Type Valid Values Description
data_path str Existing file or directory Path to a single HDF5 file or a directory of HDF5 files
output_dir str Existing directory Directory where results are saved
methods list ["UMAP", "t-SNE", "GTM", "PCA"] DR methods to benchmark (any subset)
n_components int Positive integer Number of PCA components (typically 2)
k_neighbors list[int] List of positive integers Values of k for neighborhood metrics, e.g. [2, 5, 10, 20, 50]
optimization_type str "insample", "outsample" Whether to evaluate on training data or held-out validation
scaling str "standardize", "minmax", "none" Feature scaling method
similarity_metric str "euclidean", "tanimoto" Distance metric for ambient space
sample_size int Positive integer Number of samples for metrics calculation on large datasets
test bool true, false If true, uses a single parameter combination per method (for testing)
plot_data bool true, false Whether to generate visualization plots

Optional Keys

Key Type Description
val_data_path str Path to validation HDF5 file (required when optimization_type = "outsample")
k_hit int Specific k value used as optimization target (must appear in k_neighbors)
log_level str Logging verbosity: "DEBUG", "INFO", "WARNING", "ERROR"

Example

data_path = "datasets/CHEMBL204.h5"
output_dir = "results/insample_eucl"
methods = ["UMAP", "t-SNE", "GTM", "PCA"]
n_components = 2
k_neighbors = [2, 5, 10, 20, 50]
k_hit = 20
optimization_type = "insample"
scaling = "standardize"
similarity_metric = "euclidean"
sample_size = 2500
test = false
plot_data = true
log_level = "INFO"

Feature Generation Config

File: bench_configs/features.toml

Used by: scripts/generate_descriptors.py <path>

Sections

[input]

Key Type Description
input_path str Directory containing input SMILES files
file_pattern str Glob pattern for input files (e.g. "*.smi")
output_path str Directory where output HDF5 files are saved

[chemdist]

Key Type Description
model_path str Path to trained PyTorch model (.pt file)
device str "cuda" or "cpu"

[chemdist.params]

Key Type Description
edge_in_feats int Number of edge features for GNN
embed_size int Embedding vector dimensionality
node_in_feats int Number of node features for GNN

[morgan]

Key Type Description
morgan_radius int Radius for Morgan fingerprints (default: 2)
morgan_fp_size int Fingerprint bit length (default: 1024)

[maccs]

No configuration needed. MACCS keys are 167-bit structural keys generated by RDKit.

[preprocess]

Controls whether to remove constant features after generation.

[logging]

Key Type Description
log_path str Directory for log files
log_level str Logging level

[parallel]

Key Type Description
num_workers int Number of parallel workers (1 = no parallelization)

Example

[input]
input_path = "datasets/"
file_pattern = "*.smi"
output_path = "features/"

[chemdist]
model_path = "models/model_trained.pt"
device = "cuda"

[chemdist.params]
edge_in_feats = 12
embed_size = 16
node_in_feats = 74

[morgan]
morgan_radius = 2
morgan_fp_size = 1024

[maccs]

[preprocess]

[logging]
log_path = "logs/"
log_level = "INFO"

[parallel]
num_workers = 4

Method Parameter Grids

Located in bench_configs/method_configs/. These define the hyperparameter search space for each DR method.

UMAP (umap_config.toml)

Parameter Default Grid Description
n_neighbors [2, 4, 6, 8, 16, 32, 64, 128, 256] Size of local neighborhood
min_dist [0.0, 0.1, 0.2, 0.3, 0.4, 0.6, 0.8, 0.99] Minimum distance between points in embedding
n_components [2] Output dimensionality

Total combinations: 72

t-SNE (tsne_config.toml)

Parameter Default Grid Description
perplexity [1, 2, 4, 8, 16, 32, 64, 128] Effective number of local neighbors
exaggeration [1, 2, 3, 4, 5, 6, 8, 16, 32] Early exaggeration factor
n_components [2] Output dimensionality

Total combinations: 72

GTM (gtm_config.toml)

Parameter Default Grid Description
num_nodes [225, 625, 1600] Number of grid nodes (square root gives grid dimension)
num_basis_functions [100, 400, 1225] Number of RBF basis functions
reg_coeff [1, 10, 100] Regularization coefficient
basis_width [0.1, 0.4, 0.8, 1.2] Width of RBF basis functions

Total combinations: 108