Feature Extraction¶
Molecular descriptor generation from SMILES strings.
Fingerprints & Preprocessing¶
cdr_bench.features.feature_preprocessing.generate_fingerprints(data_df, radius=2, fp_size=1024)
¶
Generate molecular fingerprints using RDKit.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data_df
|
DataFrame
|
The input data DataFrame containing SMILES strings in a column named 'smi'. |
required |
radius
|
int
|
The radius for the Morgan fingerprints. Default is 2. |
2
|
fp_size
|
int
|
The size of the fingerprints. Default is 1024. |
1024
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: The DataFrame with an additional column 'fp' containing the molecular fingerprints. |
cdr_bench.features.feature_preprocessing.get_features(file_path, use_fingerprints=True, radius=2, fp_size=1024)
¶
Get features by either generating fingerprints or loading from a file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
str
|
Path to the data file. |
required |
use_fingerprints
|
bool
|
Whether to generate and use molecular fingerprints. Default is True. |
True
|
radius
|
int
|
Radius for Morgan fingerprints. Default is 2. |
2
|
fp_size
|
int
|
Size of Morgan fingerprints. Default is 1024. |
1024
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: DataFrame with features. |
cdr_bench.features.feature_preprocessing.standardize_features(features, return_standardizer=False)
¶
Standardize the feature vectors by removing the mean and scaling to unit variance.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
features
|
ndarray
|
The feature vectors to standardize. |
required |
return_standardizer
|
bool
|
If True, returns the scaler used for standardization along with the standardized features. |
False
|
Returns:
| Type | Description |
|---|---|
ndarray | tuple[ndarray, StandardScaler]
|
Union[np.ndarray, Tuple[np.ndarray, StandardScaler]]: If return_standardizer is False, returns the standardized features. If return_standardizer is True, returns a tuple of standardized features and the scaler. |
cdr_bench.features.feature_preprocessing.find_nonconstant_features(data)
¶
Identify columns with constant variance in a numpy array.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
ndarray
|
The input data array. |
required |
Returns:
| Type | Description |
|---|---|
ndarray
|
np.ndarray: Indices of columns with non-constant variance. |
cdr_bench.features.feature_preprocessing.remove_constant_features(data_df, indices, feature_name)
¶
Remove columns with constant variance from a DataFrame's 'fp' column.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data_df
|
DataFrame
|
The input data DataFrame. |
required |
indices
|
ndarray
|
Indices of columns with non-constant variance. |
required |
feature_name
|
str
|
Name of the column with features |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: DataFrame with non-constant variance features. |
Physicochemical Descriptors¶
cdr_bench.features.physchem.calculate_descriptors(mol)
¶
Calculate six physicochemical properties: HBD, HBA, LogP, MW, TPSA, and RTB.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
mol
|
Mol
|
RDKit molecule object. |
required |
Returns:
| Type | Description |
|---|---|
list[float | None]
|
List[Optional[float]]: List of calculated properties or None for invalid molecules. |
ChemDist GNN Embeddings¶
cdr_bench.features.chemdist_features.load_model(config)
¶
Loads and initializes the model with the specified parameters from the configuration.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
Dict[str, Any]
|
Dictionary with model configuration, including model path and parameters. |
required |
Returns:
| Type | Description |
|---|---|
Module
|
torch.nn.Module: The initialized and loaded model, ready for evaluation. |
cdr_bench.features.chemdist_features.generate_embeddings(df, model, node_featurizer, edge_featurizer)
¶
Generates graph-based embeddings for molecules in the DataFrame, stores them in a new column, and removes rows with NaN embeddings.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
DataFrame containing a 'smi' column with SMILES strings. |
required |
model
|
Module
|
Pre-trained model to generate embeddings. |
required |
node_featurizer
|
CanonicalAtomFeaturizer
|
Featurizer for atoms in the molecular graph. |
required |
edge_featurizer
|
CanonicalBondFeaturizer
|
Featurizer for bonds in the molecular graph. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: Updated DataFrame with a new 'embed' column containing embeddings, and rows with NaN values in the 'embed' column removed. |