Skip to content

Feature Extraction

Molecular descriptor generation from SMILES strings.

Fingerprints & Preprocessing

cdr_bench.features.feature_preprocessing.generate_fingerprints(data_df, radius=2, fp_size=1024)

Generate molecular fingerprints using RDKit.

Parameters:

Name Type Description Default
data_df DataFrame

The input data DataFrame containing SMILES strings in a column named 'smi'.

required
radius int

The radius for the Morgan fingerprints. Default is 2.

2
fp_size int

The size of the fingerprints. Default is 1024.

1024

Returns:

Type Description
DataFrame

pd.DataFrame: The DataFrame with an additional column 'fp' containing the molecular fingerprints.

cdr_bench.features.feature_preprocessing.get_features(file_path, use_fingerprints=True, radius=2, fp_size=1024)

Get features by either generating fingerprints or loading from a file.

Parameters:

Name Type Description Default
file_path str

Path to the data file.

required
use_fingerprints bool

Whether to generate and use molecular fingerprints. Default is True.

True
radius int

Radius for Morgan fingerprints. Default is 2.

2
fp_size int

Size of Morgan fingerprints. Default is 1024.

1024

Returns:

Type Description
DataFrame

pd.DataFrame: DataFrame with features.

cdr_bench.features.feature_preprocessing.standardize_features(features, return_standardizer=False)

Standardize the feature vectors by removing the mean and scaling to unit variance.

Parameters:

Name Type Description Default
features ndarray

The feature vectors to standardize.

required
return_standardizer bool

If True, returns the scaler used for standardization along with the standardized features.

False

Returns:

Type Description
ndarray | tuple[ndarray, StandardScaler]

Union[np.ndarray, Tuple[np.ndarray, StandardScaler]]: If return_standardizer is False, returns the standardized features. If return_standardizer is True, returns a tuple of standardized features and the scaler.

cdr_bench.features.feature_preprocessing.find_nonconstant_features(data)

Identify columns with constant variance in a numpy array.

Parameters:

Name Type Description Default
data ndarray

The input data array.

required

Returns:

Type Description
ndarray

np.ndarray: Indices of columns with non-constant variance.

cdr_bench.features.feature_preprocessing.remove_constant_features(data_df, indices, feature_name)

Remove columns with constant variance from a DataFrame's 'fp' column.

Parameters:

Name Type Description Default
data_df DataFrame

The input data DataFrame.

required
indices ndarray

Indices of columns with non-constant variance.

required
feature_name str

Name of the column with features

required

Returns:

Type Description
DataFrame

pd.DataFrame: DataFrame with non-constant variance features.

Physicochemical Descriptors

cdr_bench.features.physchem.calculate_descriptors(mol)

Calculate six physicochemical properties: HBD, HBA, LogP, MW, TPSA, and RTB.

Parameters:

Name Type Description Default
mol Mol

RDKit molecule object.

required

Returns:

Type Description
list[float | None]

List[Optional[float]]: List of calculated properties or None for invalid molecules.

ChemDist GNN Embeddings

cdr_bench.features.chemdist_features.load_model(config)

Loads and initializes the model with the specified parameters from the configuration.

Parameters:

Name Type Description Default
config Dict[str, Any]

Dictionary with model configuration, including model path and parameters.

required

Returns:

Type Description
Module

torch.nn.Module: The initialized and loaded model, ready for evaluation.

cdr_bench.features.chemdist_features.generate_embeddings(df, model, node_featurizer, edge_featurizer)

Generates graph-based embeddings for molecules in the DataFrame, stores them in a new column, and removes rows with NaN embeddings.

Parameters:

Name Type Description Default
df DataFrame

DataFrame containing a 'smi' column with SMILES strings.

required
model Module

Pre-trained model to generate embeddings.

required
node_featurizer CanonicalAtomFeaturizer

Featurizer for atoms in the molecular graph.

required
edge_featurizer CanonicalBondFeaturizer

Featurizer for bonds in the molecular graph.

required

Returns:

Type Description
DataFrame

pd.DataFrame: Updated DataFrame with a new 'embed' column containing embeddings, and rows with NaN values in the 'embed' column removed.