Skip to content

I/O Utilities

HDF5 data handling, configuration loading, and data preprocessing.

Configuration

cdr_bench.io_utils.io.load_config(config_file)

Load the configuration from a TOML file.

cdr_bench.io_utils.io.validate_config(config)

Validates the TOML configuration for required fields, types, and values.

Parameters:

Name Type Description Default
config dict

The loaded configuration dictionary.

required

Raises:

Type Description
ValueError

If the configuration is invalid.

HDF5 I/O

cdr_bench.io_utils.io.check_hdf5_file_format(file_path)

cdr_bench.io_utils.io.read_features_hdf5_dataframe(file_path)

cdr_bench.io_utils.io.save_dataframe_to_hdf5(df, file_path, non_feature_columns, feature_columns)

Save a DataFrame to an HDF5 file with a hierarchical structure.

Parameters:

Name Type Description Default
df DataFrame

The DataFrame to save.

required
file_path str

The path to the HDF5 file.

required
non_feature_columns Union[List[str], Dict[str, str]]

List or dictionary of non-feature columns.

required
feature_columns Union[List[str], Dict[str, str]]

List or dictionary of feature columns.

required

Returns:

Type Description
None

None

cdr_bench.io_utils.io.save_optimization_results(df, results, file_name, feature_name)

Save DataFrame and corresponding arrays of coordinates to a single HDF5 file.

Parameters: df (pd.DataFrame): The DataFrame to save. results (defaultdict): A defaultdict with method names as keys and MethodResult namedtuples as values. file_name (str): The name of the file to save the data to. feature_name (str): The name of the feature to which to save the data.

cdr_bench.io_utils.io.read_optimization_results(file_name, feature_name, method_names)

Read the optimization results from an HDF5 file.

Parameters:

Name Type Description Default
file_name str

The name of the HDF5 file to read the data from.

required
feature_name str

The name of the feature to read the data for.

required
method_names List[str]

List with method names

required

Returns:

Name Type Description
Tuple tuple[DataFrame, ndarray, dict[str, dict[str, Any]]]

A tuple containing: - DataFrame: The DataFrame that was saved. - np.ndarray: The feature array if it exists, else None. - Dict: A dictionary with method names as keys and another dictionary with 'metrics' and 'coordinates' as values.

cdr_bench.io_utils.io.load_fp_array(file_path)

Load the fingerprint array from an HDF5 file.

Parameters:

Name Type Description Default
file_path str

Path to the HDF5 file.

required

Returns:

Type Description
ndarray

np.ndarray: Fingerprint array.

cdr_bench.io_utils.io.load_hdf5_data(file_name, method_names)

Load DataFrame and corresponding arrays of coordinates from a single HDF5 file.

Parameters: file_name (str): The name of the file to load the data from. method_names (list of str): The names corresponding to each array of coordinates.

Returns: pd.DataFrame: The loaded DataFrame. list of np.array: The list of loaded arrays of coordinates.

Data Preprocessing

cdr_bench.io_utils.data_preprocessing.prepare_data_for_optimization(data_df, val_data_df, feature_name, scaling)

Prepare data for optimization by scaling and optionally transforming reference data.

Parameters:

Name Type Description Default
data_df DataFrame

The input data DataFrame containing molecular fingerprints.

required
val_data_df Optional[DataFrame]

The validation data file (if available) containing molecular fingerprints.

required
feature_name str

The name of the feature to use

required
scaling Optional[str]

The type of the feature preprocessing to use (standardization by default)

required

Returns:

Type Description
tuple[ndarray, ndarray | None, ndarray, ndarray | None]

Tuple[pd.DataFrame, Optional[pd.DataFrame], np.ndarray, Optional[np.ndarray]]: - processed data DataFrame with constant features removed - processed validation DataFrame with constant features removed (if provided) - scaled high-dimensional data (X_transformed) - scaled reference data (y_transformed, if validation data was provided)

cdr_bench.io_utils.data_preprocessing.remove_duplicates(dataset_name, df, column_name)

Remove duplicate rows from a DataFrame based on a column containing NumPy arrays.

Parameters:

Name Type Description Default
dataset_name str

Name of the dataset.

required
df DataFrame

The input DataFrame.

required
column_name str

The name of the column containing NumPy arrays.

required

Returns:

Type Description
DataFrame

pd.DataFrame: A new DataFrame with duplicate rows removed.

cdr_bench.io_utils.data_preprocessing.get_pca_results(X_transformed, y_transformed, dataset_output_dir, n_components)

Perform PCA on the transformed data, and save the PCA results and high-dimensional data to HDF5 files.

Parameters:

Name Type Description Default
X_transformed ndarray

High-dimensional data after scaling.

required
y_transformed Optional[ndarray]

Reference high-dimensional data after scaling.

required
dataset_output_dir str

Directory to save the HDF5 files.

required
n_components int

Number of principal components to compute.

required

Returns:

Type Description
tuple[ndarray, ndarray | None, Any]

Tuple[np.ndarray, Optional[np.ndarray], Any]: PCA transformed data, reference PCA transformed data (if available), and PCA model.

cdr_bench.io_utils.data_preprocessing.create_output_directory(output_dir, file_path)

Create an output directory for the dataset based on the output directory and file path.

Parameters:

Name Type Description Default
output_dir str

The base output directory where the dataset-specific directory will be created.

required
file_path str

The path to the dataset file.

required

Returns:

Name Type Description
str str

The path to the created dataset-specific output directory.