I/O Utilities¶

HDF5 data handling, configuration loading, and data preprocessing.

Configuration¶

`cdr_bench.io_utils.io.load_config(config_file)` ¶

Load the configuration from a TOML file.

`cdr_bench.io_utils.io.validate_config(config)` ¶

Validates the TOML configuration for required fields, types, and values.

Parameters:

Name	Type	Description	Default
`config`	`dict`	The loaded configuration dictionary.	required

Raises:

Type	Description
`ValueError`	If the configuration is invalid.

HDF5 I/O¶

`cdr_bench.io_utils.io.check_hdf5_file_format(file_path)` ¶

`cdr_bench.io_utils.io.read_features_hdf5_dataframe(file_path)` ¶

`cdr_bench.io_utils.io.save_dataframe_to_hdf5(df, file_path, non_feature_columns, feature_columns)` ¶

Save a DataFrame to an HDF5 file with a hierarchical structure.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The DataFrame to save.	required
`file_path`	`str`	The path to the HDF5 file.	required
`non_feature_columns`	`Union[List[str], Dict[str, str]]`	List or dictionary of non-feature columns.	required
`feature_columns`	`Union[List[str], Dict[str, str]]`	List or dictionary of feature columns.	required

Returns:

Type	Description
`None`	None

`cdr_bench.io_utils.io.save_optimization_results(df, results, file_name, feature_name)` ¶

Save DataFrame and corresponding arrays of coordinates to a single HDF5 file.

Parameters: df (pd.DataFrame): The DataFrame to save. results (defaultdict): A defaultdict with method names as keys and MethodResult namedtuples as values. file_name (str): The name of the file to save the data to. feature_name (str): The name of the feature to which to save the data.

`cdr_bench.io_utils.io.read_optimization_results(file_name, feature_name, method_names)` ¶

Read the optimization results from an HDF5 file.

Parameters:

Name	Type	Description	Default
`file_name`	`str`	The name of the HDF5 file to read the data from.	required
`feature_name`	`str`	The name of the feature to read the data for.	required
`method_names`	`List[str]`	List with method names	required

Returns:

Name	Type	Description
`Tuple`	`tuple[DataFrame, ndarray, dict[str, dict[str, Any]]]`	A tuple containing: - DataFrame: The DataFrame that was saved. - np.ndarray: The feature array if it exists, else None. - Dict: A dictionary with method names as keys and another dictionary with 'metrics' and 'coordinates' as values.

`cdr_bench.io_utils.io.load_fp_array(file_path)` ¶

Load the fingerprint array from an HDF5 file.

Parameters:

Name	Type	Description	Default
`file_path`	`str`	Path to the HDF5 file.	required

Returns:

Type	Description
`ndarray`	np.ndarray: Fingerprint array.

`cdr_bench.io_utils.io.load_hdf5_data(file_name, method_names)` ¶

Load DataFrame and corresponding arrays of coordinates from a single HDF5 file.

Parameters: file_name (str): The name of the file to load the data from. method_names (list of str): The names corresponding to each array of coordinates.

Returns: pd.DataFrame: The loaded DataFrame. list of np.array: The list of loaded arrays of coordinates.

Data Preprocessing¶

`cdr_bench.io_utils.data_preprocessing.prepare_data_for_optimization(data_df, val_data_df, feature_name, scaling)` ¶

Prepare data for optimization by scaling and optionally transforming reference data.

Parameters:

Name	Type	Description	Default
`data_df`	`DataFrame`	The input data DataFrame containing molecular fingerprints.	required
`val_data_df`	`Optional[DataFrame]`	The validation data file (if available) containing molecular fingerprints.	required
`feature_name`	`str`	The name of the feature to use	required
`scaling`	`Optional[str]`	The type of the feature preprocessing to use (standardization by default)	required

Returns:

Type	Description
`tuple[ndarray, ndarray \| None, ndarray, ndarray \| None]`	Tuple[pd.DataFrame, Optional[pd.DataFrame], np.ndarray, Optional[np.ndarray]]: - processed data DataFrame with constant features removed - processed validation DataFrame with constant features removed (if provided) - scaled high-dimensional data (X_transformed) - scaled reference data (y_transformed, if validation data was provided)

`cdr_bench.io_utils.data_preprocessing.remove_duplicates(dataset_name, df, column_name)` ¶

Remove duplicate rows from a DataFrame based on a column containing NumPy arrays.

Parameters:

Name	Type	Description	Default
`dataset_name`	`str`	Name of the dataset.	required
`df`	`DataFrame`	The input DataFrame.	required
`column_name`	`str`	The name of the column containing NumPy arrays.	required

Returns:

Type	Description
`DataFrame`	pd.DataFrame: A new DataFrame with duplicate rows removed.

`cdr_bench.io_utils.data_preprocessing.get_pca_results(X_transformed, y_transformed, dataset_output_dir, n_components)` ¶

Perform PCA on the transformed data, and save the PCA results and high-dimensional data to HDF5 files.

Parameters:

Name	Type	Description	Default
`X_transformed`	`ndarray`	High-dimensional data after scaling.	required
`y_transformed`	`Optional[ndarray]`	Reference high-dimensional data after scaling.	required
`dataset_output_dir`	`str`	Directory to save the HDF5 files.	required
`n_components`	`int`	Number of principal components to compute.	required

Returns:

Type	Description
`tuple[ndarray, ndarray \| None, Any]`	Tuple[np.ndarray, Optional[np.ndarray], Any]: PCA transformed data, reference PCA transformed data (if available), and PCA model.

`cdr_bench.io_utils.data_preprocessing.create_output_directory(output_dir, file_path)` ¶

Create an output directory for the dataset based on the output directory and file path.

Parameters:

Name	Type	Description	Default
`output_dir`	`str`	The base output directory where the dataset-specific directory will be created.	required
`file_path`	`str`	The path to the dataset file.	required

Returns:

Name	Type	Description
`str`	`str`	The path to the created dataset-specific output directory.

I/O Utilities¶

Configuration¶

cdr_bench.io_utils.io.load_config(config_file) ¶

cdr_bench.io_utils.io.validate_config(config) ¶

HDF5 I/O¶

cdr_bench.io_utils.io.check_hdf5_file_format(file_path) ¶

cdr_bench.io_utils.io.read_features_hdf5_dataframe(file_path) ¶

cdr_bench.io_utils.io.save_dataframe_to_hdf5(df, file_path, non_feature_columns, feature_columns) ¶

cdr_bench.io_utils.io.save_optimization_results(df, results, file_name, feature_name) ¶

cdr_bench.io_utils.io.read_optimization_results(file_name, feature_name, method_names) ¶

cdr_bench.io_utils.io.load_fp_array(file_path) ¶

cdr_bench.io_utils.io.load_hdf5_data(file_name, method_names) ¶

Data Preprocessing¶

cdr_bench.io_utils.data_preprocessing.prepare_data_for_optimization(data_df, val_data_df, feature_name, scaling) ¶

cdr_bench.io_utils.data_preprocessing.remove_duplicates(dataset_name, df, column_name) ¶

cdr_bench.io_utils.data_preprocessing.get_pca_results(X_transformed, y_transformed, dataset_output_dir, n_components) ¶

cdr_bench.io_utils.data_preprocessing.create_output_directory(output_dir, file_path) ¶

`cdr_bench.io_utils.io.load_config(config_file)` ¶

`cdr_bench.io_utils.io.validate_config(config)` ¶

`cdr_bench.io_utils.io.check_hdf5_file_format(file_path)` ¶

`cdr_bench.io_utils.io.read_features_hdf5_dataframe(file_path)` ¶

`cdr_bench.io_utils.io.save_dataframe_to_hdf5(df, file_path, non_feature_columns, feature_columns)` ¶

`cdr_bench.io_utils.io.save_optimization_results(df, results, file_name, feature_name)` ¶

`cdr_bench.io_utils.io.read_optimization_results(file_name, feature_name, method_names)` ¶

`cdr_bench.io_utils.io.load_fp_array(file_path)` ¶

`cdr_bench.io_utils.io.load_hdf5_data(file_name, method_names)` ¶

`cdr_bench.io_utils.data_preprocessing.prepare_data_for_optimization(data_df, val_data_df, feature_name, scaling)` ¶

`cdr_bench.io_utils.data_preprocessing.remove_duplicates(dataset_name, df, column_name)` ¶

`cdr_bench.io_utils.data_preprocessing.get_pca_results(X_transformed, y_transformed, dataset_output_dir, n_components)` ¶

`cdr_bench.io_utils.data_preprocessing.create_output_directory(output_dir, file_path)` ¶