The id_estimation module
The id_estimation module contains the IdEstimation class.
The different algorithms of intrinsic dimension estimation are implemented as methods of this class.
- class id_estimation.IdEstimation(*args, **kwargs)[source]
IdEstimation class.
- compute_id_2NN(algorithm='base', mu_fraction=0.9, data_fraction=1, n_iter=None, set_attr=True)[source]
Compute intrinsic dimension using the 2NN algorithm.
- Parameters:
algorithm (str) – ‘base’ to perform the linear fit, ‘ml’ to perform maximum likelihood
mu_fraction (float) – fraction of mus that will be considered for the estimate (discard highest mus)
data_fraction (float) – fraction of randomly sampled points used to compute the id
n_iter (int) – number of times the ID is computed on data subsets (useful when decimation < 1)
set_attr (bool) – whether to change the class attributes as a result of the computation
- Returns:
id (float) – the estimated intrinsic dimension
id_err (float) – the standard error on the id estimation
rs (float) – the average nearest neighbor distance (rs)
Quick Start:
from dadapy import Data from sklearn.datasets import make_swiss_roll n_samples = 5000 X, _ = make_swiss_roll(n_samples, noise=0.0) ie = Data(coordinates=X) results = ie.compute_id_2NN() results: (1.96, 0.0, 0.38) #(id, error, average distance to the first two neighbors) results = ie.compute_id_2NN(fraction = 1) results: (1.98, 0.0, 0.38) #(id, error, average distance to the first two neighbors) results = ie.compute_id_2NN(decimation = 0.25) results: (1.99, 0.036, 0.76) #(id, error, average distance to the first two neighbors) #1/4 of the points are kept. #'id' is the mean over 4 bootstrap samples; #'error' is standard error of the sample mean.
References
E. Facco, M. d’Errico, A. Rodriguez, A. Laio, Estimating the intrinsic dimension of datasets by a minimal neighborhood information, Scientific reports 7 (1) (2017) 1–8
- compute_id_2NN_wprior(alpha=2, beta=5, posterior_mean=True)[source]
Compute the intrinsic dimension using a bayesian formulation of 2nn.
- Parameters:
alpha (float) – parameter of the Gamma prior
beta (float) – parameter of the Gamma prior
posterior_mean (bool) – whether to use the posterior mean as estimator, if False the posterior mode will be used
- Returns:
id (float) – the estimated intrinsic dimension
id_err (float) – the standard error on the id estimation
rs (float) – the average nearest neighbor distance (rs)
- compute_id_binomial_k(k, r, bayes=True, plot_mv=False, plot_posterior=False, k_bootstrap=1)[source]
Calculate id using the binomial estimator by fixing the number of neighbours.
As in the case in which one fixes rk, also in this version of the estimation one removes the central point from n and k. Furthermore, one has to remove also the k-th NN, as it plays the role of the distance at which rk is taken. So if k=5 it means the 5th NN from the central point will be considered, taking into account 6 points though (the central one too). This means that in principle k_eff = 6, to which I’m supposed to subtract 2. For this reason in the computation of the MLE we have directly k-1, which explicitly would be k_eff-2
- Parameters:
k (int) – number of neighbours to take into account
r (float) – ratio between internal and external shells
bayes (bool, default=True) – choose method between bayes (True) and mle (False). The bayesian estimate gives the mean value and std of d, while mle returns the max of the likelihood and the std according to Cramer-Rao lower bound
plot_mv (bool, default=False) – whether to print the output of the model validation
plot_posterior (bool, default=False) – if True, together with bayes, plots the posterior of the ID
- Returns:
id (float) – the estimated intrinsic dimension
id_err (float) – the standard error on the id estimation
scale (float) – the average nearest neighbor distance (rs)
pv (float) – p-value of the test statistics through Epps-Singleton test
- compute_id_binomial_rk(rk, r, bayes=True, plot_mv=False, plot_posterior=False)[source]
Calculate the id using the binomial estimator by fixing the same eternal radius for all the points.
In the estimation of the id one has to remove the central point from the counting of n and k as it is not effectively part of the poisson process generating its neighbourhood.
- Parameters:
rk (float or np.ndarray(float)) – radius of the external shell
r (float) – ratio between internal and external shell
bayes (bool, default=True) – choose method between bayes (True) and mle (False). The bayesian estimate gives the mean value and std of d, while mle returns the max of the likelihood and the std according to Cramer-Rao lower bound
plot_mv (bool, default=False) – whether to print the output of the model validation
plot_posterior (bool, default=False) – if True, together with bayes, plots the posterior of the ID
- Returns:
id (float) – the estimated intrinsic dimension
id_err (float) – the standard error on the id estimation
scale (float) – scale at which the id is performed
pv (float) – p-value of the test statistics computed with Epps-Singleton model validation
- return_id_scaling_2NN(n_min=10, algorithm='base', mu_fraction=0.9, set_attr=False, return_sizes=False)[source]
Compute the id with the 2NN algorithm at different scales.
The different scales are obtained by sampling subsets of [N, N/2, N/4, N/8, …, n_min] data points.
- Parameters:
n_min (int) – minimum number of points considered when decimating the dataset, n_min effectively sets the largest ‘scale’;
algorithm (str) – ‘base’ to perform the linear fit, ‘ml’ to perform maximum likelihood;
mu_fraction (float) – fraction of mus that will be considered for the estimate (discard highest mus).
- Returns:
ids_scaling (np.ndarray(float)) – array of intrinsic dimensions;
ids_scaling_err (np.ndarray(float)) – array of error estimates;
scales (np.ndarray(int)) – array of maximum nearest neighbor rank included in the estimate
Quick Start:
from dadapy import Data from sklearn.datasets import make_swiss_roll #two dimensional curved manifold embedded in 3d with noise n_samples = 5000 X, _ = make_swiss_roll(n_samples, noise=0.3) ie = Data(coordinates=X) ids_scaling, ids_scaling_err, rs_scaling = ie.return_id_scaling_2NN(n_min = 20) ids_scaling: array([2.88 2.77 2.65 2.42 2.22 2.2 2.1 2.23]) ids_scaling_err: array([0. 0.02 0.05 0.04 0.04 0.03 0.04 0.04]) scales: array([2 4 8 16 32 64 128 256])
- return_id_scaling_gride(range_max=64, d0=0.001, d1=1000, eps=1e-07, set_attr=False, return_ranks=False)[source]
Compute the id at different scales using the Gride algorithm.
- Parameters:
range_max (int) – maximum nearest neighbor rank considered for the id computations; the number of id estimates are log2(range_max) as the nearest neighbor order (‘scale’) is doubled at each estimate;
d0 (float) – minimum intrinsic dimension considered in the search;
d1 (float) – maximum intrinsic dimension considered in the search;
eps (float) – precision of the approximate id calculation.
set_attr (bool) – whether to change the class attributes as a result of the computation
- Returns:
ids_scaling (np.ndarray(float)) – array of intrinsic dimensions of length log2(range_max);
ids_scaling_err (np.ndarray(float)) – array of error estimates;
rs_scaling (np.ndarray(float)) – array of average distances of the neighbors involved in the estimates.
Quick Start:
from dadapy import Data from sklearn.datasets import make_swiss_roll #two dimensional curved manifold embedded in 3d with noise n_samples = 5000 X, _ = make_swiss_roll(n_samples, noise=0.3) ie = Data(coordinates=X) ids_scaling, ids_scaling_err, rs_scaling = ie.return_id_scaling_gride(range_max = 512) ids_scaling: array([2.81 2.71 2.48 2.27 2.11 1.98 1.95 2.05]) ids_scaling_err: array([0.04 0.03 0.02 0.01 0.01 0.01 0. 0. ]) rs_scaling: array([0.52 0.69 0.93 1.26 1.75 2.48 3.54 4.99])
References
F. Denti, D. Doimo, A. Laio, A. Mira, Distributional results for model-based intrinsic dimension estimators, arXiv preprint arXiv:2104.13832 (2021).