Scoring & Metrics¶
Evaluation metrics for dimensionality reduction quality.
DRScorer¶
cdr_bench.scoring.scoring.DRScorer
¶
A class to score dimensionality reduction models.
Attributes:
| Name | Type | Description |
|---|---|---|
estimator |
BaseEstimator
|
Dimensionality reduction model. |
scoring_params |
ScoringParams
|
Parameters for scoring. |
__init__(estimator, scoring_params)
¶
overlap_scoring(X, y=None)
¶
Calculates the overlap percentage between the nearest neighbors in the reduced dimension space and the original high-dimensional space.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
ndarray
|
Low-dimensional coordinates. |
required |
y
|
ndarray
|
Target values (not used in this function). |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
float |
float
|
Overlap percentage score. |
overlap_scoring_list(X, y=None)
¶
Calculates per-point overlap fraction between nearest neighbors in the reduced dimension space and the original high-dimensional space.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
ndarray
|
Low-dimensional coordinates. |
required |
y
|
ndarray
|
Target values (not used in this function). |
None
|
Returns:
| Type | Description |
|---|---|
list[float]
|
List[float]: Per-point overlap fractions (0 to 1). |
get_scoring_function(scoring_type)
¶
Returns the appropriate scoring function based on the type.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
scoring_type
|
str
|
The type of scoring function to use. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
Callable |
Callable[[ndarray, ndarray | None], float]
|
A scoring function. |
Distance Functions¶
cdr_bench.scoring.scoring.calculate_distance_matrix(data, metric)
¶
Calculate the distance matrix for the given data using the specified metric.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
ndarray
|
The input data matrix. |
required |
metric
|
str
|
The distance metric to use ('euclidean' or 'tanimoto'). |
required |
Returns:
| Type | Description |
|---|---|
ndarray
|
np.ndarray: The calculated distance matrix. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If an unsupported similarity metric is provided. |
cdr_bench.scoring.scoring.calculate_distance_2_matrices(data_1, data_2, metric)
¶
Calculate the distance matrix for the given data using the specified metric.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data_1
|
ndarray
|
The input data matrix. |
required |
data_2
|
ndarray
|
The input data matrix. |
required |
metric
|
str
|
The distance metric to use ('euclidean' or 'tanimoto'). |
required |
Returns:
| Type | Description |
|---|---|
ndarray
|
np.ndarray: The calculated distance matrix. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If an unsupported similarity metric is provided. |
cdr_bench.scoring.scoring.euclidean_distance_square_numba(x1, x2)
¶
Calculate the squared Euclidean distance between each pair of vectors in two arrays using Numba for optimization.
cdr_bench.scoring.scoring.tanimoto_int_similarity_matrix_numba(v_a, v_b)
¶
Implement the Tanimoto similarity measure for integer matrices, comparing each vector in v_a against each in v_b.
Parameters: - v_a (np.ndarray): Numpy matrix where each row represents a vector a. - v_b (np.ndarray): Numpy matrix where each row represents a vector b.
Returns: - np.ndarray: Matrix of computed similarity scores, where element (i, j) is the similarity between row i of v_a and row j of v_b.
Co-ranking Analysis¶
cdr_bench.scoring.scoring.coranking_matrix(distances_high, distances_low, k=None, use_numba=True)
¶
Compute a co-ranking matrix to compare the preservation of neighborhood relations between two different representations of data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
distances_high
|
ndarray
|
Distance matrix in the high-dimensional space. |
required |
distances_low
|
ndarray
|
Distance matrix in the low-dimensional space. |
required |
k
|
int
|
The neighborhood size to consider. Defaults to the number of samples if None. |
None
|
use_numba
|
bool
|
Flag to use Numba optimized function for large datasets. Defaults to True. |
True
|
Returns:
| Type | Description |
|---|---|
ndarray
|
np.ndarray: A k x k co-ranking matrix. |
cdr_bench.scoring.scoring.coranking_measures(Q, k_neighbors=None)
¶
Analyze the co-ranking matrix to compute various metrics such as AUC, Qlocal, and Qglobal.
cdr_bench.scoring.scoring.calculate_trustworthiness(Q, k)
¶
Calculate the trustworthiness of a dimensionality reduction based on the positions of the nearest neighbors. proceedings.mlr.press / v4 / lee08a / lee08a.pdf
Parameters: - Q (np.ndarray): A co-ranking matrix. - k (int): The number of nearest neighbors to consider.
Returns: - float: The trustworthiness score, between 0 and 1.
cdr_bench.scoring.scoring.calculate_continuity(Q, k)
¶
Calculate the continuity of a dimensionality reduction based on the positions of the farthest neighbors. proceedings.mlr.press / v4 / lee08a / lee08a.pdf
Parameters: - Q (np.ndarray): A co-ranking matrix. - k (int): The number of farthest neighbors to consider.
Returns: - float: The continuity score, between 0 and 1.
Aggregate Metrics¶
cdr_bench.scoring.scoring.calculate_metrics(ambient_dist, latent_dist, k_neighbors, num_samples=None, num_repeats=3)
¶
Calculate various metrics for dimensionality reduction evaluation, with optional sampling.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ambient_dist
|
ndarray
|
Distance matrix of the high-dimensional data. |
required |
latent_dist
|
ndarray
|
Distance matrix of the low-dimensional (latent) data. |
required |
k_neighbors
|
List[int]
|
List of k values for nearest neighbors. |
required |
num_samples
|
int
|
Number of samples to use for subsampling. If None, no sampling is done. |
None
|
num_repeats
|
int
|
Number of times to repeat the sampling process. Default is 3. |
3
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Dict[str, Any]: A dictionary containing the calculated metrics. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the shapes of ambient_dist and latent_dist do not match. |
cdr_bench.scoring.scoring.correlate_distances(distances_high, distances_low, method='spearman')
¶
Calculate correlation between flattened distance matrices.
cdr_bench.scoring.scoring.residual_variance(distances_high, distances_low, method='spearman')
¶
Calculate residual variance (1 - r^2) between flattened distance matrices.
Nearest Neighbors¶
cdr_bench.scoring.scoring.fit_nearest_neighbors(distance_matrix, k_neighbors)
¶
Fit the NearestNeighbors model and find nearest neighbors indices.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
distance_matrix
|
ndarray
|
Distance matrix for the dataset. |
required |
k_neighbors
|
int
|
k to use for calculation of nearest neighbors. |
required |
Returns:
| Type | Description |
|---|---|
tuple[NearestNeighbors, ndarray]
|
Tuple[NearestNeighbors, np.ndarray]: NearestNeighbors model and neighbors indices. |
cdr_bench.scoring.scoring.prepare_nearest_neighbors(distance_matrix, k_neighbors)
¶
Prepare nearest neighbors and scoring parameters based on the distance matrices.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
distance_matrix
|
ndarray
|
Distance matrix for the dataset. |
required |
k_neighbors
|
int
|
k to use for calculation of nearest neigbors. |
required |
Returns:
| Type | Description |
|---|---|
tuple[NearestNeighbors, Any]
|
Tuple[NearestNeighbors, Any]: NearestNeighbors model and scoring parameters. |
cdr_bench.scoring.scoring.calculate_nn_overlap_list(coords, indices_original, k_neighbors, n_components=2)
¶
Calculate the nearest neighbor overlap scores for different k values.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
coords
|
ndarray
|
Low-dimensional coordinates from dimensionality reduction. |
required |
indices_original
|
ndarray
|
Indices of nearest neighbors in the high-dimensional space. |
required |
k_neighbors
|
List[int]
|
List of k values for nearest neighbors. |
required |
n_components
|
int
|
Number of components in the low-dimensional space. |
2
|
Returns:
| Type | Description |
|---|---|
list[float]
|
List[float]: Nearest neighbor overlap scores for each k value. |
Chemical & Network Statistics¶
cdr_bench.scoring.chemsim_stat.calculate_similarity_statistics(sim_mat)
¶
Calculate statistics on the similarity matrix: min, 1st quartile, median, mean, 3rd quartile, max, and standard deviation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
sim_mat
|
ndarray
|
A 2D similarity matrix. |
required |
Returns:
| Type | Description |
|---|---|
dict[str, float]
|
Dict[str, float]: Dictionary of similarity metrics. |
cdr_bench.scoring.scaffold_stat.calculate_scaffold_frequencies_and_f50(scaffolds, save_distribution=False)
¶
Calculate scaffold frequencies and the F50 metric, which is the minimum fraction of unique scaffolds needed to represent 50% of the dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
scaffolds
|
List[str]
|
List of scaffold SMILES strings. |
required |
save_distribution
|
bool
|
If the dataframe with a distribution of scaffolds should be saved (default=False) |
False
|
Returns:
| Type | Description |
|---|---|
tuple[DataFrame, float]
|
Tuple[pd.DataFrame, float]: DataFrame with scaffold frequencies and F50 metric. |
cdr_bench.scoring.network_stat
¶
build_network_from_similarity(similarity_matrix, cids, threshold)
¶
Build a NetworkX graph from a similarity matrix based on a given similarity threshold.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
similarity_matrix
|
ndarray
|
The similarity matrix where each entry represents the similarity between two molecules. |
required |
cids
|
List[str]
|
List of unique identifiers (IDs) for the molecules. |
required |
threshold
|
float
|
Similarity threshold above which an edge is created between nodes. |
required |
Returns:
| Type | Description |
|---|---|
Graph
|
nx.Graph: A NetworkX graph where nodes represent molecules (identified by IDs) and edges represent pairwise similarities above the threshold. |
generate_networks_for_thresholds(similarity_matrix, cids, thresholds)
¶
Generate a dictionary of similarity networks for each specified threshold.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
similarity_matrix
|
ndarray
|
NumPy array for fingerprint similarity. |
required |
cids
|
List[str]
|
List of compound ids. |
required |
thresholds
|
List[float]
|
List of similarity thresholds to apply. |
required |
Returns:
| Type | Description |
|---|---|
dict[float, Graph]
|
Dict[float, nx.Graph]: A dictionary where keys are thresholds and values are the corresponding similarity networks. |
calculate_network_metrics(G, name)
¶
Calculate various network diversity metrics for a given graph.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
G
|
Graph
|
The NetworkX graph for which metrics are calculated. |
required |
name
|
str
|
The name of the network for identification. |
required |
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Dict[str, Any]: A dictionary containing the network name and calculated metrics. |