I/O Utilities¶
HDF5 data handling, configuration loading, and data preprocessing.
Configuration¶
cdr_bench.io_utils.io.load_config(config_file)
¶
Load the configuration from a TOML file.
cdr_bench.io_utils.io.validate_config(config)
¶
Validates the TOML configuration for required fields, types, and values.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
dict
|
The loaded configuration dictionary. |
required |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the configuration is invalid. |
HDF5 I/O¶
cdr_bench.io_utils.io.check_hdf5_file_format(file_path)
¶
cdr_bench.io_utils.io.read_features_hdf5_dataframe(file_path)
¶
cdr_bench.io_utils.io.save_dataframe_to_hdf5(df, file_path, non_feature_columns, feature_columns)
¶
Save a DataFrame to an HDF5 file with a hierarchical structure.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The DataFrame to save. |
required |
file_path
|
str
|
The path to the HDF5 file. |
required |
non_feature_columns
|
Union[List[str], Dict[str, str]]
|
List or dictionary of non-feature columns. |
required |
feature_columns
|
Union[List[str], Dict[str, str]]
|
List or dictionary of feature columns. |
required |
Returns:
| Type | Description |
|---|---|
None
|
None |
cdr_bench.io_utils.io.save_optimization_results(df, results, file_name, feature_name)
¶
Save DataFrame and corresponding arrays of coordinates to a single HDF5 file.
Parameters: df (pd.DataFrame): The DataFrame to save. results (defaultdict): A defaultdict with method names as keys and MethodResult namedtuples as values. file_name (str): The name of the file to save the data to. feature_name (str): The name of the feature to which to save the data.
cdr_bench.io_utils.io.read_optimization_results(file_name, feature_name, method_names)
¶
Read the optimization results from an HDF5 file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_name
|
str
|
The name of the HDF5 file to read the data from. |
required |
feature_name
|
str
|
The name of the feature to read the data for. |
required |
method_names
|
List[str]
|
List with method names |
required |
Returns:
| Name | Type | Description |
|---|---|---|
Tuple |
tuple[DataFrame, ndarray, dict[str, dict[str, Any]]]
|
A tuple containing: - DataFrame: The DataFrame that was saved. - np.ndarray: The feature array if it exists, else None. - Dict: A dictionary with method names as keys and another dictionary with 'metrics' and 'coordinates' as values. |
cdr_bench.io_utils.io.load_fp_array(file_path)
¶
Load the fingerprint array from an HDF5 file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
str
|
Path to the HDF5 file. |
required |
Returns:
| Type | Description |
|---|---|
ndarray
|
np.ndarray: Fingerprint array. |
cdr_bench.io_utils.io.load_hdf5_data(file_name, method_names)
¶
Load DataFrame and corresponding arrays of coordinates from a single HDF5 file.
Parameters: file_name (str): The name of the file to load the data from. method_names (list of str): The names corresponding to each array of coordinates.
Returns: pd.DataFrame: The loaded DataFrame. list of np.array: The list of loaded arrays of coordinates.
Data Preprocessing¶
cdr_bench.io_utils.data_preprocessing.prepare_data_for_optimization(data_df, val_data_df, feature_name, scaling)
¶
Prepare data for optimization by scaling and optionally transforming reference data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data_df
|
DataFrame
|
The input data DataFrame containing molecular fingerprints. |
required |
val_data_df
|
Optional[DataFrame]
|
The validation data file (if available) containing molecular fingerprints. |
required |
feature_name
|
str
|
The name of the feature to use |
required |
scaling
|
Optional[str]
|
The type of the feature preprocessing to use (standardization by default) |
required |
Returns:
| Type | Description |
|---|---|
tuple[ndarray, ndarray | None, ndarray, ndarray | None]
|
Tuple[pd.DataFrame, Optional[pd.DataFrame], np.ndarray, Optional[np.ndarray]]: - processed data DataFrame with constant features removed - processed validation DataFrame with constant features removed (if provided) - scaled high-dimensional data (X_transformed) - scaled reference data (y_transformed, if validation data was provided) |
cdr_bench.io_utils.data_preprocessing.remove_duplicates(dataset_name, df, column_name)
¶
Remove duplicate rows from a DataFrame based on a column containing NumPy arrays.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dataset_name
|
str
|
Name of the dataset. |
required |
df
|
DataFrame
|
The input DataFrame. |
required |
column_name
|
str
|
The name of the column containing NumPy arrays. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: A new DataFrame with duplicate rows removed. |
cdr_bench.io_utils.data_preprocessing.get_pca_results(X_transformed, y_transformed, dataset_output_dir, n_components)
¶
Perform PCA on the transformed data, and save the PCA results and high-dimensional data to HDF5 files.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X_transformed
|
ndarray
|
High-dimensional data after scaling. |
required |
y_transformed
|
Optional[ndarray]
|
Reference high-dimensional data after scaling. |
required |
dataset_output_dir
|
str
|
Directory to save the HDF5 files. |
required |
n_components
|
int
|
Number of principal components to compute. |
required |
Returns:
| Type | Description |
|---|---|
tuple[ndarray, ndarray | None, Any]
|
Tuple[np.ndarray, Optional[np.ndarray], Any]: PCA transformed data, reference PCA transformed data (if available), and PCA model. |
cdr_bench.io_utils.data_preprocessing.create_output_directory(output_dir, file_path)
¶
Create an output directory for the dataset based on the output directory and file path.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
output_dir
|
str
|
The base output directory where the dataset-specific directory will be created. |
required |
file_path
|
str
|
The path to the dataset file. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
The path to the created dataset-specific output directory. |