The metric_comparisons module

The metric_comparisons module contains the MetricComparisons class.

Algorithms for comparing different spaces are implemented as methods of this class.

class metric_comparisons.MetricComparisons(coordinates=None, distances=None, maxk=None, period=None, verbose=False, n_jobs=2)[source]

Class for the metric comparisons.

greedy_feature_selection_full(n_coords, k=1, n_best=10, symm=True)[source]

Greedy selection of the set of features which is most informative about full distance measure.

Using the n-best single features describing the full feature space, one more of all other features is added combinatorically to make a candidate pool of duplets. Then, using the n-best duplets describing the full space, one more of all other features is added to make a candidate pool of triplets, etc. This procedure is done until including the desired number of features (n_coords) is reached.

Parameters:

n_coords – number of coodinates after which the algorithm is stopped
k (int) – number of neighbours considered in the computation of the imbalances
n_best (int) – the n_best tuples are chosen in each iteration to combinatorically add one variable and calculate the imbalance until n_coords is reached
symm (bool) – whether to use the symmetrised information imbalance

Returns:

best_tuples (list(list(int))) – best coordinates selected at each iteration
best_imbalances (np.ndarray(float,float)) – imbalances (full–>coords, coords–>full) computed at each iteration, belonging to the best tuple
all_imbalances (list(list(list(int)))) – all imbalances (full–>coords, coords–>full), computed at each iteration, belonging all greedy tuples

greedy_feature_selection_target(target_ranks, n_coords, k, n_best, symm=True)[source]

Greedy selection of the set of features which is most informative about a target distance.

Using the n-best single features describing the target_ranks, one more of all other features is added combinatorically to make a candidate pool of duplets. Then, using the n-best duplets describing the target_ranks, one more of all other features is added to make a candidate pool of triplets, etc. This procedure is done until including the desired number of variables (n_coords) is reached.

Parameters:

target_ranks (np.ndarray(int)) – an array containing the ranks in the target space, could be e.g. the nearest neighbor ranks for a different set of variables on the same data points.
n_coords – number of coodinates after which the algorithm is stopped
k (int) – number of neighbours considered in the computation of the imbalances
n_best (int) – the n_best tuples are chosen in each iteration to combinatorically add one variable and calculate the imbalance until n_coords is reached
symm (bool) – whether to use the symmetrised information imbalance

Returns:

best_tuples (list(list(int))) – best coordinates selected at each iteration
best_imbalances (np.ndarray(float,float)) – imbalances (full–>coords, coords–>full) computed at each iteration, belonging to the best tuple
all_imbalances (list(list(list(int)))) – all imbalances (full–>coords, coords–>full), computed at each iteration, belonging all greedy tuples

return_data_overlap(coordinates=None, distances=None, dist_indices=None, k=30, avg=True, use_cython=True)[source]

Return the neighbour overlap between the full space and another dataset.

An overlap of 1 means that all neighbours of a point are the same in the two spaces.

Parameters:

coordinates (np.ndarray(float)) – the data set to compare, of shape (N , dimension of embedding space)
distances (np.ndarray(float), tuple(np.ndarray(float), np.ndarray(float))) – Distance matrix (see base class for shape explanation)
k (int) – the number of neighbours considered for the overlap

Returns:

(float) – the neighbour overlap of the points

return_inf_imb_causality(cause_present, effect_present, effect_future, weights, conditioning_present=None, k=1, period_cause=None, period_effect=None, period_conditioning=None)[source]

Return the imbalances (weight * cause_present, effect_present) -> effect_future.

When conditioning_present is not None, the first space is extended with an additional weight, resulting in (weight1 * cause_present, weight2 * conditioning_present, effect_present) -> effect_future.

Parameters:

cause_present (np.ndarray(float)) – N x D1 matrix, putative driver system data set at time 0
effect_present (np.ndarray(float)) – N x D2 matrix, putative driven system data set at time 0
effect_future (np.ndarray(float)) – N x D2 matrix, putative driven system data set at time tau
weights (list(float), np.ndarray(float)) – scaling parameters for the variables at time 0 (1D array if conditioning_present is None, 2D array of shape (n_weights,2) otherwise, where the first column is referred to ‘cause_present’ and the second one to ‘conditioning_present’)
conditioning_present (np.ndarray(float) – N x D3 matrix, conditioning system data set at time 0
k (int) – order of nearest neighbour considered for the calculation of the imbalance
period_cause (int,float,np.ndarray(float)) – periods of variables in ‘cause_present’
period_effect (int,float,np.ndarray(float)) – periods of variables in ‘effect_present’ and ‘effect_future’
period_conditioning (int,float,np.ndarray(float)) – periods of variables in ‘conditioning_present’

Returns:

imbalances (np.ndarray(float)) – the information imbalances for the different weights

return_inf_imb_causality_conditioning(cause_present, effect_present, conditioning_present, effect_future, weights_cause, weights_conditioning, k=1, period_cause=None, period_effect=None, period_conditioning=None)[source]

Return the scanned imbalances in presence and in absence of the putative causal system.

Parameters:

cause_present (np.ndarray(float)) – N x D1 matrix, putative driver system data set at time 0
effect_present (np.ndarray(float)) – N x D2 matrix, putative driven system data set at time 0
conditioning_present (np.ndarray(float)) – N x D3 matrix, conditioning driven system data set at time 0
effect_future (np.ndarray(float)) – N x D2 matrix, putative driven system data set at time tau
weights_cause (list(float), np.ndarray(float)) – scaling parameters for the causal variables
weights_conditioning (list(float), np.ndarray(float)) – scaling parameters for the conditioning variables
k (int) – order of nearest neighbour considered for the calculation of the imbalance
period_cause (int,float,np.ndarray(float)) – periods of variables in ‘cause_present’
period_effect (int,float,np.ndarray(float)) – periods of variables in ‘effect_present’ and ‘effect_future’
period_conditioning (int,float,np.ndarray(float)) – periods of variables in ‘conditioning_present’

Returns:

imbs_no_cause (np.ndarray(float)) – array of shape (weights_conditioning,) containing the imbalances (weight*cause_present, effect_present) -> effect_future
imbs_with_cause (np.ndarray(float)) – array of shape (weights_cause * weights_conditioning,) containing the imbalances (weight * cause_present, weight_conditioning * conditioning_present, effect_present) -> effect_future

return_inf_imb_causality_input_rank(ranks_present, effect_future, k=1, period_effect=None)[source]

Return the imbalances (weight * cause_present, effect_present) -> effect_future.

Parameters:

ranks_present (np.ndarray(float)) – array of shape (N_weights, N, maxk+1), containing N_weights matrices (N, maxk+1) corresponding to the scanned values of the scaling parameter
effect_future (np.ndarray(float)) – N x D2 matrix, putative driven system data set at time tau
k (int) – order of nearest neighbour considered for the calculation of the imbalance
period_effect (int,float,np.ndarray(float)) – periods of the variables in ‘effect_future’

Returns:

imbalances (np.ndarray(float)) – the information imbalances for the different weights included in ‘ranks_present’

return_inf_imb_full_all_coords(k=1)[source]

Compute the information imbalances between the ‘full’ space and each one of its D features.

Parameters:

k (int) – number of neighbours considered in the computation of the imbalances

Returns:

(np.array (float)) – a 2xD matrix containing the information imbalances between
the original space and each of its D features.

return_inf_imb_full_all_dplets(d, k=1)[source]

Compute the information imbalances between the full space and all possible combinations of d coordinates.

Parameters:

d (int) – target order considered (e.g., d = 2 will compute all couples of coordinates)
k (int) – number of neighbours considered in the computation of the imbalances

Returns:

coord_list – list of the set of coordinates for which imbalances are computed
imbalances – the correspinding couples of information imbalances

return_inf_imb_full_selected_coords(coord_list, k=1)[source]

Compute the information imbalances between the ‘full’ space and a selection of features.

Parameters:

coord_list (list(list(int))) – a list of the type [[1, 2], [8, 3, 5], …] where each sub-list defines a set of coordinates for which the information imbalance should be computed.
k (int) – number of neighbours considered in the computation of the imbalances

Returns:

(np.array (float)) – a 2xL matrix containing the information imbalances between
the original space and each one of the L subspaces defined in coord_list

return_inf_imb_matrix_of_coords(k=1)[source]

Compute the information imbalances between all pairs of single features of the data.

Parameters:: k (int) – number of neighbours considered in the computation of the imbalances
Returns:: n_mat (np.array(float)) – a DxD matrix containing all the information imbalances

return_inf_imb_target_all_coords(target_ranks, k=1)[source]

Compute the information imbalances between the ‘target’ space and a all single feature spaces in X.

Parameters:

target_ranks (np.array(int)) – an array containing the ranks in the target space
k (int) – number of neighbours considered in the computation of the imbalances

Returns:

(np.array (float)) – a 2xL matrix containing the information imbalances between
the target space and each one of the L subspaces defined in coord_list

return_inf_imb_target_all_dplets(target_ranks, d, k=1)[source]

Compute the information imbalances between a target distance and all combinations of d coordinates of X.

Parameters:

target_ranks (np.array(int)) – an array containing the ranks in the target space
d (int) – target order considered (e.g., d = 2 will compute all couples of coordinates)
k (int) – number of neighbours considered in the computation of the imbalances

Returns:

coord_list – list of the set of coordinates for which imbalances are computed
imbalances – the correspinding couples of information imbalances

return_inf_imb_target_selected_coords(target_ranks, coord_list, k=1)[source]

Compute the information imbalances between the ‘target’ space and a selection of features.

Parameters:

target_ranks (np.ndarray(int)) – an array containing the ranks in the target space, could be e.g. the nearest neighbor ranks for a different set of variables on the same data points.
coord_list (list(list(int))) – a list of the type [[1, 2], [8, 3, 5], …] where each sub-list defines a set of coordinates for which the information imbalance should be computed.
k (int) – number of neighbours considered in the computation of the imbalances

Returns:

(np.array (float)) – a 2xL matrix containing the information imbalances between
the target space and each one of the L subspaces defined in coord_list

return_inf_imb_two_selected_coords(coords1, coords2, k=1)[source]

Return the imbalances between distances taken as the i and the j component of the coordinate matrix X.

Parameters:

coords1 (list(int)) – components for the first distance
coords2 (list(int)) – components for the second distance
k (int) – order of nearest neighbour considered for the calculation of the imbalance, default is 1

Returns:

(float, float) – the information imbalance from distance i to distance j and vice versa

return_information_imbalace(coordinates, k=1, subset_size=2000, repeats=None, avg=True)[source]

Return the imbalance with another dataset X.

Parameters:

coordinates (np.ndarray(float)) – the coordinates of the othe dataset (N , dimension of embedding space).
k (int) – order of nearest neighbour considered for the calculation of the imbalance, default is 1,
subset_size (int) – size of the subsets on which the information imbalance is computed.
repeats (int) – the number of repetitions for the information imbalance calculation.

Returns:

(np.array, np.array) – the information imbalances their standard error

return_label_overlap(labels, k=None, avg=True, coords=None, class_fraction=None, weighted=True)[source]

Return the neighbour overlap between the full space and a set of labels.

An overlap of 1 means that all neighbours of a point have the same label as the central point.

Parameters:

labels (list) – the labels with respect to which the overlap is computed.
k (int) – the number of neighbours considered for the overlap.
coords (array) – subset of indices on which the overlap is computed.
class_fraction (float) – number of nearest neighbor considered expressed as a fraction of the total number of class samples. Useful when classes are imbalanced.
weighted (bool) – if True the overlap is weighted inversely proportional to the class population.

Returns:

(float) – the neighbour overlap with the class labels.

return_label_overlap_coords(labels, coords, k=30)[source]

Return the neighbour overlap between a selection of coordinates and a set of labels.

An overlap of 1 means that all neighbours of a point have the same label as the central point.

Parameters:

labels (np.ndarray) – the labels with respect to which the overlap is computed
coords (list(int)) – a list of coordinates to consider for the distance computation
k (int) – the number of neighbours considered for the overlap

Returns:

(float) – the neighbour overlap of the points

return_label_overlap_selected_coords(labels, coord_list, k=30)[source]

Return a list of neighbour overlaps computed on a list of selected coordinates.

An overlap of 1 means that all neighbours of a point have the same label as the central point.

Parameters:

labels (np.ndarray) – the labels with respect to which the overlap is computed
coord_list (list(list(int))) – a list of lists, with each sublist representing a set of coordinates
k – the number of neighbours considered for the overlap

Returns:

(list (float)) – a list of neighbour overlaps of the points

return_overlap_coords(coords1, coords2, k=30)[source]

Return the neighbour overlap between two subspaces defined by two sets of coordinates.

An overlap of 1 means that in the two subspaces all points have an identical neighbourhood.

Parameters:

coords1 (list(int)) – the list of coordinates defining the first subspace
coords2 (list(int)) – the list of coordinates defining the second subspace
k (int) – the number of neighbours considered for the overlap

Returns:

(float) – the neighbour overlap of the two subspaces

return_ranks_present_for_all_weights(cause_present, effect_present, weights, period_cause=None, period_effect=None)[source]

Return the nearest neighbors’ indices in space (weight*cause_present, effect_present) for all weights.

Parameters:

cause_present (np.ndarray(float)) – N x D1 matrix, putative driver system data set at time 0
effect_present (np.ndarray(float)) – N x D2 matrix, putative driven system data set at time 0
weights (list(float), np.ndarray(float)) – scaling parameters for the driver system at time 0
period_cause (int,float,np.ndarray(float)) – periods of variables in ‘cause_present’
period_effect (int,float,np.ndarray(float)) – periods of variables in ‘effect_present’

Returns:

ranks_present (np.ndarray(float)) – array of shape (N_weights, N, maxk+1), containing N_weights matrices (N, maxk+1) corresponding to the values of the scaling parameters in ‘weights’