Skip to content

Scoring & Metrics

Evaluation metrics for dimensionality reduction quality.

DRScorer

cdr_bench.scoring.scoring.DRScorer

A class to score dimensionality reduction models.

Attributes:

Name Type Description
estimator BaseEstimator

Dimensionality reduction model.

scoring_params ScoringParams

Parameters for scoring.

__init__(estimator, scoring_params)

overlap_scoring(X, y=None)

Calculates the overlap percentage between the nearest neighbors in the reduced dimension space and the original high-dimensional space.

Parameters:

Name Type Description Default
X ndarray

Low-dimensional coordinates.

required
y ndarray

Target values (not used in this function).

None

Returns:

Name Type Description
float float

Overlap percentage score.

overlap_scoring_list(X, y=None)

Calculates per-point overlap fraction between nearest neighbors in the reduced dimension space and the original high-dimensional space.

Parameters:

Name Type Description Default
X ndarray

Low-dimensional coordinates.

required
y ndarray

Target values (not used in this function).

None

Returns:

Type Description
list[float]

List[float]: Per-point overlap fractions (0 to 1).

get_scoring_function(scoring_type)

Returns the appropriate scoring function based on the type.

Parameters:

Name Type Description Default
scoring_type str

The type of scoring function to use.

required

Returns:

Name Type Description
Callable Callable[[ndarray, ndarray | None], float]

A scoring function.

Distance Functions

cdr_bench.scoring.scoring.calculate_distance_matrix(data, metric)

Calculate the distance matrix for the given data using the specified metric.

Parameters:

Name Type Description Default
data ndarray

The input data matrix.

required
metric str

The distance metric to use ('euclidean' or 'tanimoto').

required

Returns:

Type Description
ndarray

np.ndarray: The calculated distance matrix.

Raises:

Type Description
ValueError

If an unsupported similarity metric is provided.

cdr_bench.scoring.scoring.calculate_distance_2_matrices(data_1, data_2, metric)

Calculate the distance matrix for the given data using the specified metric.

Parameters:

Name Type Description Default
data_1 ndarray

The input data matrix.

required
data_2 ndarray

The input data matrix.

required
metric str

The distance metric to use ('euclidean' or 'tanimoto').

required

Returns:

Type Description
ndarray

np.ndarray: The calculated distance matrix.

Raises:

Type Description
ValueError

If an unsupported similarity metric is provided.

cdr_bench.scoring.scoring.euclidean_distance_square_numba(x1, x2)

Calculate the squared Euclidean distance between each pair of vectors in two arrays using Numba for optimization.

cdr_bench.scoring.scoring.tanimoto_int_similarity_matrix_numba(v_a, v_b)

Implement the Tanimoto similarity measure for integer matrices, comparing each vector in v_a against each in v_b.

Parameters: - v_a (np.ndarray): Numpy matrix where each row represents a vector a. - v_b (np.ndarray): Numpy matrix where each row represents a vector b.

Returns: - np.ndarray: Matrix of computed similarity scores, where element (i, j) is the similarity between row i of v_a and row j of v_b.

Co-ranking Analysis

cdr_bench.scoring.scoring.coranking_matrix(distances_high, distances_low, k=None, use_numba=True)

Compute a co-ranking matrix to compare the preservation of neighborhood relations between two different representations of data.

Parameters:

Name Type Description Default
distances_high ndarray

Distance matrix in the high-dimensional space.

required
distances_low ndarray

Distance matrix in the low-dimensional space.

required
k int

The neighborhood size to consider. Defaults to the number of samples if None.

None
use_numba bool

Flag to use Numba optimized function for large datasets. Defaults to True.

True

Returns:

Type Description
ndarray

np.ndarray: A k x k co-ranking matrix.

cdr_bench.scoring.scoring.coranking_measures(Q, k_neighbors=None)

Analyze the co-ranking matrix to compute various metrics such as AUC, Qlocal, and Qglobal.

cdr_bench.scoring.scoring.calculate_trustworthiness(Q, k)

Calculate the trustworthiness of a dimensionality reduction based on the positions of the nearest neighbors. proceedings.mlr.press / v4 / lee08a / lee08a.pdf

Parameters: - Q (np.ndarray): A co-ranking matrix. - k (int): The number of nearest neighbors to consider.

Returns: - float: The trustworthiness score, between 0 and 1.

cdr_bench.scoring.scoring.calculate_continuity(Q, k)

Calculate the continuity of a dimensionality reduction based on the positions of the farthest neighbors. proceedings.mlr.press / v4 / lee08a / lee08a.pdf

Parameters: - Q (np.ndarray): A co-ranking matrix. - k (int): The number of farthest neighbors to consider.

Returns: - float: The continuity score, between 0 and 1.

Aggregate Metrics

cdr_bench.scoring.scoring.calculate_metrics(ambient_dist, latent_dist, k_neighbors, num_samples=None, num_repeats=3)

Calculate various metrics for dimensionality reduction evaluation, with optional sampling.

Parameters:

Name Type Description Default
ambient_dist ndarray

Distance matrix of the high-dimensional data.

required
latent_dist ndarray

Distance matrix of the low-dimensional (latent) data.

required
k_neighbors List[int]

List of k values for nearest neighbors.

required
num_samples int

Number of samples to use for subsampling. If None, no sampling is done.

None
num_repeats int

Number of times to repeat the sampling process. Default is 3.

3

Returns:

Type Description
dict[str, Any]

Dict[str, Any]: A dictionary containing the calculated metrics.

Raises:

Type Description
ValueError

If the shapes of ambient_dist and latent_dist do not match.

cdr_bench.scoring.scoring.correlate_distances(distances_high, distances_low, method='spearman')

Calculate correlation between flattened distance matrices.

cdr_bench.scoring.scoring.residual_variance(distances_high, distances_low, method='spearman')

Calculate residual variance (1 - r^2) between flattened distance matrices.

Nearest Neighbors

cdr_bench.scoring.scoring.fit_nearest_neighbors(distance_matrix, k_neighbors)

Fit the NearestNeighbors model and find nearest neighbors indices.

Parameters:

Name Type Description Default
distance_matrix ndarray

Distance matrix for the dataset.

required
k_neighbors int

k to use for calculation of nearest neighbors.

required

Returns:

Type Description
tuple[NearestNeighbors, ndarray]

Tuple[NearestNeighbors, np.ndarray]: NearestNeighbors model and neighbors indices.

cdr_bench.scoring.scoring.prepare_nearest_neighbors(distance_matrix, k_neighbors)

Prepare nearest neighbors and scoring parameters based on the distance matrices.

Parameters:

Name Type Description Default
distance_matrix ndarray

Distance matrix for the dataset.

required
k_neighbors int

k to use for calculation of nearest neigbors.

required

Returns:

Type Description
tuple[NearestNeighbors, Any]

Tuple[NearestNeighbors, Any]: NearestNeighbors model and scoring parameters.

cdr_bench.scoring.scoring.calculate_nn_overlap_list(coords, indices_original, k_neighbors, n_components=2)

Calculate the nearest neighbor overlap scores for different k values.

Parameters:

Name Type Description Default
coords ndarray

Low-dimensional coordinates from dimensionality reduction.

required
indices_original ndarray

Indices of nearest neighbors in the high-dimensional space.

required
k_neighbors List[int]

List of k values for nearest neighbors.

required
n_components int

Number of components in the low-dimensional space.

2

Returns:

Type Description
list[float]

List[float]: Nearest neighbor overlap scores for each k value.

Chemical & Network Statistics

cdr_bench.scoring.chemsim_stat.calculate_similarity_statistics(sim_mat)

Calculate statistics on the similarity matrix: min, 1st quartile, median, mean, 3rd quartile, max, and standard deviation.

Parameters:

Name Type Description Default
sim_mat ndarray

A 2D similarity matrix.

required

Returns:

Type Description
dict[str, float]

Dict[str, float]: Dictionary of similarity metrics.

cdr_bench.scoring.scaffold_stat.calculate_scaffold_frequencies_and_f50(scaffolds, save_distribution=False)

Calculate scaffold frequencies and the F50 metric, which is the minimum fraction of unique scaffolds needed to represent 50% of the dataset.

Parameters:

Name Type Description Default
scaffolds List[str]

List of scaffold SMILES strings.

required
save_distribution bool

If the dataframe with a distribution of scaffolds should be saved (default=False)

False

Returns:

Type Description
tuple[DataFrame, float]

Tuple[pd.DataFrame, float]: DataFrame with scaffold frequencies and F50 metric.

cdr_bench.scoring.network_stat

build_network_from_similarity(similarity_matrix, cids, threshold)

Build a NetworkX graph from a similarity matrix based on a given similarity threshold.

Parameters:

Name Type Description Default
similarity_matrix ndarray

The similarity matrix where each entry represents the similarity between two molecules.

required
cids List[str]

List of unique identifiers (IDs) for the molecules.

required
threshold float

Similarity threshold above which an edge is created between nodes.

required

Returns:

Type Description
Graph

nx.Graph: A NetworkX graph where nodes represent molecules (identified by IDs) and edges represent pairwise similarities above the threshold.

generate_networks_for_thresholds(similarity_matrix, cids, thresholds)

Generate a dictionary of similarity networks for each specified threshold.

Parameters:

Name Type Description Default
similarity_matrix ndarray

NumPy array for fingerprint similarity.

required
cids List[str]

List of compound ids.

required
thresholds List[float]

List of similarity thresholds to apply.

required

Returns:

Type Description
dict[float, Graph]

Dict[float, nx.Graph]: A dictionary where keys are thresholds and values are the corresponding similarity networks.

calculate_network_metrics(G, name)

Calculate various network diversity metrics for a given graph.

Parameters:

Name Type Description Default
G Graph

The NetworkX graph for which metrics are calculated.

required
name str

The name of the network for identification.

required

Returns:

Type Description
dict[str, Any]

Dict[str, Any]: A dictionary containing the network name and calculated metrics.