Configuration Reference¶
All scripts are driven by TOML configuration files located in bench_configs/.
Benchmarking Config¶
File: bench_configs/run_benchmarking.toml
Used by: scripts/run_benchmarking.py --config <path>
Required Keys¶
| Key | Type | Valid Values | Description |
|---|---|---|---|
data_path |
str |
Existing file or directory | Path to a single HDF5 file or a directory of HDF5 files |
output_dir |
str |
Existing directory | Directory where results are saved |
methods |
list |
["UMAP", "t-SNE", "GTM", "PCA"] |
DR methods to benchmark (any subset) |
n_components |
int |
Positive integer | Number of PCA components (typically 2) |
k_neighbors |
list[int] |
List of positive integers | Values of k for neighborhood metrics, e.g. [2, 5, 10, 20, 50] |
optimization_type |
str |
"insample", "outsample" |
Whether to evaluate on training data or held-out validation |
scaling |
str |
"standardize", "minmax", "none" |
Feature scaling method |
similarity_metric |
str |
"euclidean", "tanimoto" |
Distance metric for ambient space |
sample_size |
int |
Positive integer | Number of samples for metrics calculation on large datasets |
test |
bool |
true, false |
If true, uses a single parameter combination per method (for testing) |
plot_data |
bool |
true, false |
Whether to generate visualization plots |
Optional Keys¶
| Key | Type | Description |
|---|---|---|
val_data_path |
str |
Path to validation HDF5 file (required when optimization_type = "outsample") |
k_hit |
int |
Specific k value used as optimization target (must appear in k_neighbors) |
log_level |
str |
Logging verbosity: "DEBUG", "INFO", "WARNING", "ERROR" |
Example¶
data_path = "datasets/CHEMBL204.h5"
output_dir = "results/insample_eucl"
methods = ["UMAP", "t-SNE", "GTM", "PCA"]
n_components = 2
k_neighbors = [2, 5, 10, 20, 50]
k_hit = 20
optimization_type = "insample"
scaling = "standardize"
similarity_metric = "euclidean"
sample_size = 2500
test = false
plot_data = true
log_level = "INFO"
Feature Generation Config¶
File: bench_configs/features.toml
Used by: scripts/generate_descriptors.py <path>
Sections¶
[input]¶
| Key | Type | Description |
|---|---|---|
input_path |
str |
Directory containing input SMILES files |
file_pattern |
str |
Glob pattern for input files (e.g. "*.smi") |
output_path |
str |
Directory where output HDF5 files are saved |
[chemdist]¶
| Key | Type | Description |
|---|---|---|
model_path |
str |
Path to trained PyTorch model (.pt file) |
device |
str |
"cuda" or "cpu" |
[chemdist.params]¶
| Key | Type | Description |
|---|---|---|
edge_in_feats |
int |
Number of edge features for GNN |
embed_size |
int |
Embedding vector dimensionality |
node_in_feats |
int |
Number of node features for GNN |
[morgan]¶
| Key | Type | Description |
|---|---|---|
morgan_radius |
int |
Radius for Morgan fingerprints (default: 2) |
morgan_fp_size |
int |
Fingerprint bit length (default: 1024) |
[maccs]¶
No configuration needed. MACCS keys are 167-bit structural keys generated by RDKit.
[preprocess]¶
Controls whether to remove constant features after generation.
[logging]¶
| Key | Type | Description |
|---|---|---|
log_path |
str |
Directory for log files |
log_level |
str |
Logging level |
[parallel]¶
| Key | Type | Description |
|---|---|---|
num_workers |
int |
Number of parallel workers (1 = no parallelization) |
Example¶
[input]
input_path = "datasets/"
file_pattern = "*.smi"
output_path = "features/"
[chemdist]
model_path = "models/model_trained.pt"
device = "cuda"
[chemdist.params]
edge_in_feats = 12
embed_size = 16
node_in_feats = 74
[morgan]
morgan_radius = 2
morgan_fp_size = 1024
[maccs]
[preprocess]
[logging]
log_path = "logs/"
log_level = "INFO"
[parallel]
num_workers = 4
Method Parameter Grids¶
Located in bench_configs/method_configs/. These define the hyperparameter search space for each DR method.
UMAP (umap_config.toml)¶
| Parameter | Default Grid | Description |
|---|---|---|
n_neighbors |
[2, 4, 6, 8, 16, 32, 64, 128, 256] |
Size of local neighborhood |
min_dist |
[0.0, 0.1, 0.2, 0.3, 0.4, 0.6, 0.8, 0.99] |
Minimum distance between points in embedding |
n_components |
[2] |
Output dimensionality |
Total combinations: 72
t-SNE (tsne_config.toml)¶
| Parameter | Default Grid | Description |
|---|---|---|
perplexity |
[1, 2, 4, 8, 16, 32, 64, 128] |
Effective number of local neighbors |
exaggeration |
[1, 2, 3, 4, 5, 6, 8, 16, 32] |
Early exaggeration factor |
n_components |
[2] |
Output dimensionality |
Total combinations: 72
GTM (gtm_config.toml)¶
| Parameter | Default Grid | Description |
|---|---|---|
num_nodes |
[225, 625, 1600] |
Number of grid nodes (square root gives grid dimension) |
num_basis_functions |
[100, 400, 1225] |
Number of RBF basis functions |
reg_coeff |
[1, 10, 100] |
Regularization coefficient |
basis_width |
[0.1, 0.4, 0.8, 1.2] |
Width of RBF basis functions |
Total combinations: 108