Feature Extraction¶

Molecular descriptor generation from SMILES strings.

Fingerprints & Preprocessing¶

`cdr_bench.features.feature_preprocessing.generate_fingerprints(data_df, radius=2, fp_size=1024)` ¶

Generate molecular fingerprints using RDKit.

Parameters:

Name	Type	Description	Default
`data_df`	`DataFrame`	The input data DataFrame containing SMILES strings in a column named 'smi'.	required
`radius`	`int`	The radius for the Morgan fingerprints. Default is 2.	`2`
`fp_size`	`int`	The size of the fingerprints. Default is 1024.	`1024`

Returns:

Type	Description
`DataFrame`	pd.DataFrame: The DataFrame with an additional column 'fp' containing the molecular fingerprints.

`cdr_bench.features.feature_preprocessing.get_features(file_path, use_fingerprints=True, radius=2, fp_size=1024)` ¶

Get features by either generating fingerprints or loading from a file.

Parameters:

Name	Type	Description	Default
`file_path`	`str`	Path to the data file.	required
`use_fingerprints`	`bool`	Whether to generate and use molecular fingerprints. Default is True.	`True`
`radius`	`int`	Radius for Morgan fingerprints. Default is 2.	`2`
`fp_size`	`int`	Size of Morgan fingerprints. Default is 1024.	`1024`

Returns:

Type	Description
`DataFrame`	pd.DataFrame: DataFrame with features.

`cdr_bench.features.feature_preprocessing.standardize_features(features, return_standardizer=False)` ¶

Standardize the feature vectors by removing the mean and scaling to unit variance.

Parameters:

Name	Type	Description	Default
`features`	`ndarray`	The feature vectors to standardize.	required
`return_standardizer`	`bool`	If True, returns the scaler used for standardization along with the standardized features.	`False`

Returns:

Type	Description
`ndarray \| tuple[ndarray, StandardScaler]`	Union[np.ndarray, Tuple[np.ndarray, StandardScaler]]: If return_standardizer is False, returns the standardized features. If return_standardizer is True, returns a tuple of standardized features and the scaler.

`cdr_bench.features.feature_preprocessing.find_nonconstant_features(data)` ¶

Identify columns with constant variance in a numpy array.

Parameters:

Name	Type	Description	Default
`data`	`ndarray`	The input data array.	required

Returns:

Type	Description
`ndarray`	np.ndarray: Indices of columns with non-constant variance.

`cdr_bench.features.feature_preprocessing.remove_constant_features(data_df, indices, feature_name)` ¶

Remove columns with constant variance from a DataFrame's 'fp' column.

Parameters:

Name	Type	Description	Default
`data_df`	`DataFrame`	The input data DataFrame.	required
`indices`	`ndarray`	Indices of columns with non-constant variance.	required
`feature_name`	`str`	Name of the column with features	required

Returns:

Type	Description
`DataFrame`	pd.DataFrame: DataFrame with non-constant variance features.

Physicochemical Descriptors¶

`cdr_bench.features.physchem.calculate_descriptors(mol)` ¶

Calculate six physicochemical properties: HBD, HBA, LogP, MW, TPSA, and RTB.

Parameters:

Name	Type	Description	Default
`mol`	`Mol`	RDKit molecule object.	required

Returns:

Type	Description
`list[float \| None]`	List[Optional[float]]: List of calculated properties or None for invalid molecules.

ChemDist GNN Embeddings¶

`cdr_bench.features.chemdist_features.load_model(config)` ¶

Loads and initializes the model with the specified parameters from the configuration.

Parameters:

Name	Type	Description	Default
`config`	`Dict[str, Any]`	Dictionary with model configuration, including model path and parameters.	required

Returns:

Type	Description
`Module`	torch.nn.Module: The initialized and loaded model, ready for evaluation.

`cdr_bench.features.chemdist_features.generate_embeddings(df, model, node_featurizer, edge_featurizer)` ¶

Generates graph-based embeddings for molecules in the DataFrame, stores them in a new column, and removes rows with NaN embeddings.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	DataFrame containing a 'smi' column with SMILES strings.	required
`model`	`Module`	Pre-trained model to generate embeddings.	required
`node_featurizer`	`CanonicalAtomFeaturizer`	Featurizer for atoms in the molecular graph.	required
`edge_featurizer`	`CanonicalBondFeaturizer`	Featurizer for bonds in the molecular graph.	required

Returns:

Type	Description
`DataFrame`	pd.DataFrame: Updated DataFrame with a new 'embed' column containing embeddings, and rows with NaN values in the 'embed' column removed.

Feature Extraction¶

Fingerprints & Preprocessing¶

cdr_bench.features.feature_preprocessing.generate_fingerprints(data_df, radius=2, fp_size=1024) ¶

cdr_bench.features.feature_preprocessing.get_features(file_path, use_fingerprints=True, radius=2, fp_size=1024) ¶

cdr_bench.features.feature_preprocessing.standardize_features(features, return_standardizer=False) ¶

cdr_bench.features.feature_preprocessing.find_nonconstant_features(data) ¶

cdr_bench.features.feature_preprocessing.remove_constant_features(data_df, indices, feature_name) ¶

Physicochemical Descriptors¶

cdr_bench.features.physchem.calculate_descriptors(mol) ¶

ChemDist GNN Embeddings¶

cdr_bench.features.chemdist_features.load_model(config) ¶

cdr_bench.features.chemdist_features.generate_embeddings(df, model, node_featurizer, edge_featurizer) ¶

`cdr_bench.features.feature_preprocessing.generate_fingerprints(data_df, radius=2, fp_size=1024)` ¶

`cdr_bench.features.feature_preprocessing.get_features(file_path, use_fingerprints=True, radius=2, fp_size=1024)` ¶

`cdr_bench.features.feature_preprocessing.standardize_features(features, return_standardizer=False)` ¶

`cdr_bench.features.feature_preprocessing.find_nonconstant_features(data)` ¶

`cdr_bench.features.feature_preprocessing.remove_constant_features(data_df, indices, feature_name)` ¶

`cdr_bench.features.physchem.calculate_descriptors(mol)` ¶

`cdr_bench.features.chemdist_features.load_model(config)` ¶

`cdr_bench.features.chemdist_features.generate_embeddings(df, model, node_featurizer, edge_featurizer)` ¶