Scoring & Metrics¶

Evaluation metrics for dimensionality reduction quality.

DRScorer¶

`cdr_bench.scoring.scoring.DRScorer` ¶

A class to score dimensionality reduction models.

Attributes:

Name	Type	Description
`estimator`	`BaseEstimator`	Dimensionality reduction model.
`scoring_params`	`ScoringParams`	Parameters for scoring.

`init(estimator, scoring_params)` ¶

`overlap_scoring(X, y=None)` ¶

Calculates the overlap percentage between the nearest neighbors in the reduced dimension space and the original high-dimensional space.

Parameters:

Name	Type	Description	Default
`X`	`ndarray`	Low-dimensional coordinates.	required
`y`	`ndarray`	Target values (not used in this function).	`None`

Returns:

Name	Type	Description
`float`	`float`	Overlap percentage score.

`overlap_scoring_list(X, y=None)` ¶

Calculates per-point overlap fraction between nearest neighbors in the reduced dimension space and the original high-dimensional space.

Parameters:

Name	Type	Description	Default
`X`	`ndarray`	Low-dimensional coordinates.	required
`y`	`ndarray`	Target values (not used in this function).	`None`

Returns:

Type	Description
`list[float]`	List[float]: Per-point overlap fractions (0 to 1).

`get_scoring_function(scoring_type)` ¶

Returns the appropriate scoring function based on the type.

Parameters:

Name	Type	Description	Default
`scoring_type`	`str`	The type of scoring function to use.	required

Returns:

Name	Type	Description
`Callable`	`Callable[[ndarray, ndarray \| None], float]`	A scoring function.

Distance Functions¶

`cdr_bench.scoring.scoring.calculate_distance_matrix(data, metric)` ¶

Calculate the distance matrix for the given data using the specified metric.

Parameters:

Name	Type	Description	Default
`data`	`ndarray`	The input data matrix.	required
`metric`	`str`	The distance metric to use ('euclidean' or 'tanimoto').	required

Returns:

Type	Description
`ndarray`	np.ndarray: The calculated distance matrix.

Raises:

Type	Description
`ValueError`	If an unsupported similarity metric is provided.

`cdr_bench.scoring.scoring.calculate_distance_2_matrices(data_1, data_2, metric)` ¶

Calculate the distance matrix for the given data using the specified metric.

Parameters:

Name	Type	Description	Default
`data_1`	`ndarray`	The input data matrix.	required
`data_2`	`ndarray`	The input data matrix.	required
`metric`	`str`	The distance metric to use ('euclidean' or 'tanimoto').	required

Returns:

Type	Description
`ndarray`	np.ndarray: The calculated distance matrix.

Raises:

Type	Description
`ValueError`	If an unsupported similarity metric is provided.

`cdr_bench.scoring.scoring.euclidean_distance_square_numba(x1, x2)` ¶

Calculate the squared Euclidean distance between each pair of vectors in two arrays using Numba for optimization.

`cdr_bench.scoring.scoring.tanimoto_int_similarity_matrix_numba(v_a, v_b)` ¶

Implement the Tanimoto similarity measure for integer matrices, comparing each vector in v_a against each in v_b.

Parameters: - v_a (np.ndarray): Numpy matrix where each row represents a vector a. - v_b (np.ndarray): Numpy matrix where each row represents a vector b.

Returns: - np.ndarray: Matrix of computed similarity scores, where element (i, j) is the similarity between row i of v_a and row j of v_b.

Co-ranking Analysis¶

`cdr_bench.scoring.scoring.coranking_matrix(distances_high, distances_low, k=None, use_numba=True)` ¶

Compute a co-ranking matrix to compare the preservation of neighborhood relations between two different representations of data.

Parameters:

Name	Type	Description	Default
`distances_high`	`ndarray`	Distance matrix in the high-dimensional space.	required
`distances_low`	`ndarray`	Distance matrix in the low-dimensional space.	required
`k`	`int`	The neighborhood size to consider. Defaults to the number of samples if None.	`None`
`use_numba`	`bool`	Flag to use Numba optimized function for large datasets. Defaults to True.	`True`

Returns:

Type	Description
`ndarray`	np.ndarray: A k x k co-ranking matrix.

`cdr_bench.scoring.scoring.coranking_measures(Q, k_neighbors=None)` ¶

Analyze the co-ranking matrix to compute various metrics such as AUC, Qlocal, and Qglobal.

`cdr_bench.scoring.scoring.calculate_trustworthiness(Q, k)` ¶

Calculate the trustworthiness of a dimensionality reduction based on the positions of the nearest neighbors. proceedings.mlr.press / v4 / lee08a / lee08a.pdf

Parameters: - Q (np.ndarray): A co-ranking matrix. - k (int): The number of nearest neighbors to consider.

Returns: - float: The trustworthiness score, between 0 and 1.

`cdr_bench.scoring.scoring.calculate_continuity(Q, k)` ¶

Calculate the continuity of a dimensionality reduction based on the positions of the farthest neighbors. proceedings.mlr.press / v4 / lee08a / lee08a.pdf

Parameters: - Q (np.ndarray): A co-ranking matrix. - k (int): The number of farthest neighbors to consider.

Returns: - float: The continuity score, between 0 and 1.

Aggregate Metrics¶

`cdr_bench.scoring.scoring.calculate_metrics(ambient_dist, latent_dist, k_neighbors, num_samples=None, num_repeats=3)` ¶

Calculate various metrics for dimensionality reduction evaluation, with optional sampling.

Parameters:

Name	Type	Description	Default
`ambient_dist`	`ndarray`	Distance matrix of the high-dimensional data.	required
`latent_dist`	`ndarray`	Distance matrix of the low-dimensional (latent) data.	required
`k_neighbors`	`List[int]`	List of k values for nearest neighbors.	required
`num_samples`	`int`	Number of samples to use for subsampling. If None, no sampling is done.	`None`
`num_repeats`	`int`	Number of times to repeat the sampling process. Default is 3.	`3`

Returns:

Type	Description
`dict[str, Any]`	Dict[str, Any]: A dictionary containing the calculated metrics.

Raises:

Type	Description
`ValueError`	If the shapes of ambient_dist and latent_dist do not match.

`cdr_bench.scoring.scoring.correlate_distances(distances_high, distances_low, method='spearman')` ¶

Calculate correlation between flattened distance matrices.

`cdr_bench.scoring.scoring.residual_variance(distances_high, distances_low, method='spearman')` ¶

Calculate residual variance (1 - r^2) between flattened distance matrices.

Nearest Neighbors¶

`cdr_bench.scoring.scoring.fit_nearest_neighbors(distance_matrix, k_neighbors)` ¶

Fit the NearestNeighbors model and find nearest neighbors indices.

Parameters:

Name	Type	Description	Default
`distance_matrix`	`ndarray`	Distance matrix for the dataset.	required
`k_neighbors`	`int`	k to use for calculation of nearest neighbors.	required

Returns:

Type	Description
`tuple[NearestNeighbors, ndarray]`	Tuple[NearestNeighbors, np.ndarray]: NearestNeighbors model and neighbors indices.

`cdr_bench.scoring.scoring.prepare_nearest_neighbors(distance_matrix, k_neighbors)` ¶

Prepare nearest neighbors and scoring parameters based on the distance matrices.

Parameters:

Name	Type	Description	Default
`distance_matrix`	`ndarray`	Distance matrix for the dataset.	required
`k_neighbors`	`int`	k to use for calculation of nearest neigbors.	required

Returns:

Type	Description
`tuple[NearestNeighbors, Any]`	Tuple[NearestNeighbors, Any]: NearestNeighbors model and scoring parameters.

`cdr_bench.scoring.scoring.calculate_nn_overlap_list(coords, indices_original, k_neighbors, n_components=2)` ¶

Calculate the nearest neighbor overlap scores for different k values.

Parameters:

Name	Type	Description	Default
`coords`	`ndarray`	Low-dimensional coordinates from dimensionality reduction.	required
`indices_original`	`ndarray`	Indices of nearest neighbors in the high-dimensional space.	required
`k_neighbors`	`List[int]`	List of k values for nearest neighbors.	required
`n_components`	`int`	Number of components in the low-dimensional space.	`2`

Returns:

Type	Description
`list[float]`	List[float]: Nearest neighbor overlap scores for each k value.

Chemical & Network Statistics¶

`cdr_bench.scoring.chemsim_stat.calculate_similarity_statistics(sim_mat)` ¶

Calculate statistics on the similarity matrix: min, 1st quartile, median, mean, 3rd quartile, max, and standard deviation.

Parameters:

Name	Type	Description	Default
`sim_mat`	`ndarray`	A 2D similarity matrix.	required

Returns:

Type	Description
`dict[str, float]`	Dict[str, float]: Dictionary of similarity metrics.

`cdr_bench.scoring.scaffold_stat.calculate_scaffold_frequencies_and_f50(scaffolds, save_distribution=False)` ¶

Calculate scaffold frequencies and the F50 metric, which is the minimum fraction of unique scaffolds needed to represent 50% of the dataset.

Parameters:

Name	Type	Description	Default
`scaffolds`	`List[str]`	List of scaffold SMILES strings.	required
`save_distribution`	`bool`	If the dataframe with a distribution of scaffolds should be saved (default=False)	`False`

Returns:

Type	Description
`tuple[DataFrame, float]`	Tuple[pd.DataFrame, float]: DataFrame with scaffold frequencies and F50 metric.

`cdr_bench.scoring.network_stat` ¶

`build_network_from_similarity(similarity_matrix, cids, threshold)` ¶

Build a NetworkX graph from a similarity matrix based on a given similarity threshold.

Parameters:

Name	Type	Description	Default
`similarity_matrix`	`ndarray`	The similarity matrix where each entry represents the similarity between two molecules.	required
`cids`	`List[str]`	List of unique identifiers (IDs) for the molecules.	required
`threshold`	`float`	Similarity threshold above which an edge is created between nodes.	required

Returns:

Type	Description
`Graph`	nx.Graph: A NetworkX graph where nodes represent molecules (identified by IDs) and edges represent pairwise similarities above the threshold.

`generate_networks_for_thresholds(similarity_matrix, cids, thresholds)` ¶

Generate a dictionary of similarity networks for each specified threshold.

Parameters:

Name	Type	Description	Default
`similarity_matrix`	`ndarray`	NumPy array for fingerprint similarity.	required
`cids`	`List[str]`	List of compound ids.	required
`thresholds`	`List[float]`	List of similarity thresholds to apply.	required

Returns:

Type	Description
`dict[float, Graph]`	Dict[float, nx.Graph]: A dictionary where keys are thresholds and values are the corresponding similarity networks.

`calculate_network_metrics(G, name)` ¶

Calculate various network diversity metrics for a given graph.

Parameters:

Name	Type	Description	Default
`G`	`Graph`	The NetworkX graph for which metrics are calculated.	required
`name`	`str`	The name of the network for identification.	required

Returns:

Type	Description
`dict[str, Any]`	Dict[str, Any]: A dictionary containing the network name and calculated metrics.

Scoring & Metrics¶

DRScorer¶

cdr_bench.scoring.scoring.DRScorer ¶

__init__(estimator, scoring_params) ¶

overlap_scoring(X, y=None) ¶

overlap_scoring_list(X, y=None) ¶

get_scoring_function(scoring_type) ¶

Distance Functions¶

cdr_bench.scoring.scoring.calculate_distance_matrix(data, metric) ¶

cdr_bench.scoring.scoring.calculate_distance_2_matrices(data_1, data_2, metric) ¶

cdr_bench.scoring.scoring.euclidean_distance_square_numba(x1, x2) ¶

cdr_bench.scoring.scoring.tanimoto_int_similarity_matrix_numba(v_a, v_b) ¶

Co-ranking Analysis¶

cdr_bench.scoring.scoring.coranking_matrix(distances_high, distances_low, k=None, use_numba=True) ¶

cdr_bench.scoring.scoring.coranking_measures(Q, k_neighbors=None) ¶

cdr_bench.scoring.scoring.calculate_trustworthiness(Q, k) ¶

cdr_bench.scoring.scoring.calculate_continuity(Q, k) ¶

Aggregate Metrics¶

cdr_bench.scoring.scoring.calculate_metrics(ambient_dist, latent_dist, k_neighbors, num_samples=None, num_repeats=3) ¶

cdr_bench.scoring.scoring.correlate_distances(distances_high, distances_low, method='spearman') ¶

cdr_bench.scoring.scoring.residual_variance(distances_high, distances_low, method='spearman') ¶

Nearest Neighbors¶

cdr_bench.scoring.scoring.fit_nearest_neighbors(distance_matrix, k_neighbors) ¶

cdr_bench.scoring.scoring.prepare_nearest_neighbors(distance_matrix, k_neighbors) ¶

cdr_bench.scoring.scoring.calculate_nn_overlap_list(coords, indices_original, k_neighbors, n_components=2) ¶

Chemical & Network Statistics¶

cdr_bench.scoring.chemsim_stat.calculate_similarity_statistics(sim_mat) ¶

cdr_bench.scoring.scaffold_stat.calculate_scaffold_frequencies_and_f50(scaffolds, save_distribution=False) ¶

cdr_bench.scoring.network_stat ¶

build_network_from_similarity(similarity_matrix, cids, threshold) ¶

generate_networks_for_thresholds(similarity_matrix, cids, thresholds) ¶

calculate_network_metrics(G, name) ¶

`cdr_bench.scoring.scoring.DRScorer` ¶

`init(estimator, scoring_params)` ¶

`overlap_scoring(X, y=None)` ¶

`overlap_scoring_list(X, y=None)` ¶

`get_scoring_function(scoring_type)` ¶

`cdr_bench.scoring.scoring.calculate_distance_matrix(data, metric)` ¶

`cdr_bench.scoring.scoring.calculate_distance_2_matrices(data_1, data_2, metric)` ¶

`cdr_bench.scoring.scoring.euclidean_distance_square_numba(x1, x2)` ¶

`cdr_bench.scoring.scoring.tanimoto_int_similarity_matrix_numba(v_a, v_b)` ¶

`cdr_bench.scoring.scoring.coranking_matrix(distances_high, distances_low, k=None, use_numba=True)` ¶

`cdr_bench.scoring.scoring.coranking_measures(Q, k_neighbors=None)` ¶

`cdr_bench.scoring.scoring.calculate_trustworthiness(Q, k)` ¶

`cdr_bench.scoring.scoring.calculate_continuity(Q, k)` ¶

`cdr_bench.scoring.scoring.calculate_metrics(ambient_dist, latent_dist, k_neighbors, num_samples=None, num_repeats=3)` ¶

`cdr_bench.scoring.scoring.correlate_distances(distances_high, distances_low, method='spearman')` ¶

`cdr_bench.scoring.scoring.residual_variance(distances_high, distances_low, method='spearman')` ¶

`cdr_bench.scoring.scoring.fit_nearest_neighbors(distance_matrix, k_neighbors)` ¶

`cdr_bench.scoring.scoring.prepare_nearest_neighbors(distance_matrix, k_neighbors)` ¶

`cdr_bench.scoring.scoring.calculate_nn_overlap_list(coords, indices_original, k_neighbors, n_components=2)` ¶

`cdr_bench.scoring.chemsim_stat.calculate_similarity_statistics(sim_mat)` ¶

`cdr_bench.scoring.scaffold_stat.calculate_scaffold_frequencies_and_f50(scaffolds, save_distribution=False)` ¶

`cdr_bench.scoring.network_stat` ¶

`build_network_from_similarity(similarity_matrix, cids, threshold)` ¶

`generate_networks_for_thresholds(similarity_matrix, cids, thresholds)` ¶

`calculate_network_metrics(G, name)` ¶