The id_estimation module

The id_estimation module contains the IdEstimation class.

The different algorithms of intrinsic dimension estimation are implemented as methods of this class.

class id_estimation.IdEstimation(*args, **kwargs)[source]

IdEstimation class.

compute_id_2NN(algorithm='base', mu_fraction=0.9, data_fraction=1, n_iter=None, set_attr=True)[source]

Compute intrinsic dimension using the 2NN algorithm.

Parameters:
  • algorithm (str) – ‘base’ to perform the linear fit, ‘ml’ to perform maximum likelihood

  • mu_fraction (float) – fraction of mus that will be considered for the estimate (discard highest mus)

  • data_fraction (float) – fraction of randomly sampled points used to compute the id

  • n_iter (int) – number of times the ID is computed on data subsets (useful when decimation < 1)

  • set_attr (bool) – whether to change the class attributes as a result of the computation

Returns:
  • id (float) – the estimated intrinsic dimension

  • id_err (float) – the standard error on the id estimation

  • rs (float) – the average nearest neighbor distance (rs)

Quick Start:

from dadapy import Data
from sklearn.datasets import make_swiss_roll

n_samples = 5000
X, _ = make_swiss_roll(n_samples, noise=0.0)

ie = Data(coordinates=X)

results = ie.compute_id_2NN()
results:
(1.96, 0.0, 0.38)       #(id, error, average distance to the first two neighbors)

results = ie.compute_id_2NN(fraction = 1)
results:
(1.98, 0.0, 0.38)       #(id, error, average distance to the first two neighbors)

results = ie.compute_id_2NN(decimation = 0.25)
results:
(1.99, 0.036, 0.76)     #(id, error, average distance to the first two neighbors)
                        #1/4 of the points are kept.
                        #'id' is the mean over 4 bootstrap samples;
                        #'error' is standard error of the sample mean.

References

E. Facco, M. d’Errico, A. Rodriguez, A. Laio, Estimating the intrinsic dimension of datasets by a minimal neighborhood information, Scientific reports 7 (1) (2017) 1–8

compute_id_2NN_wprior(alpha=2, beta=5, posterior_mean=True)[source]

Compute the intrinsic dimension using a bayesian formulation of 2nn.

Parameters:
  • alpha (float) – parameter of the Gamma prior

  • beta (float) – parameter of the Gamma prior

  • posterior_mean (bool) – whether to use the posterior mean as estimator, if False the posterior mode will be used

Returns:
  • id (float) – the estimated intrinsic dimension

  • id_err (float) – the standard error on the id estimation

  • rs (float) – the average nearest neighbor distance (rs)

compute_id_binomial_k(k, r, bayes=True)[source]

Calculate id using the binomial estimator by fixing the number of neighbours.

As in the case in which one fixes rk, also in this version of the estimation one removes the central point from n and k. Furthermore, one has to remove also the k-th NN, as it plays the role of the distance at which rk is taken. So if k=5 it means the 5th NN from the central point will be considered, taking into account 6 points though (the central one too). This means that in principle k_eff = 6, to which I’m supposed to subtract 2. For this reason in the computation of the MLE we have directly k-1, which explicitly would be k_eff-2

Parameters:
  • k (int) – number of neighbours to take into account

  • r (float) – ratio between internal and external shells

  • bayes (bool, default=True) – choose method between bayes (True) and mle (False). The bayesian estimate gives the mean value and std of d, while mle returns the max of the likelihood and the std according to Cramer-Rao lower bound

Returns:
  • id (float) – the estimated intrinsic dimension

  • id_err (float) – the standard error on the id estimation

  • rs (float) – the average nearest neighbor distance (rs)

compute_id_binomial_rk(rk, r, bayes=True)[source]

Calculate the id using the binomial estimator by fixing the same eternal radius for all the points.

In the estimation of the id one has to remove the central point from the counting of n and k as it is not effectively part of the poisson process generating its neighbourhood.

Parameters:
  • rk (float) – radius of the external shell

  • r (float) – ratio between internal and external shell

  • bayes (bool, default=True) – choose method between bayes (True) and mle (False). The bayesian estimate gives the mean value and std of d, while mle returns the max of the likelihood and the std according to Cramer-Rao lower bound

Returns:
  • id (float) – the estimated intrinsic dimension

  • id_err (float) – the standard error on the id estimation

  • rs (float) – the average nearest neighbor distance (rs)

return_id_scaling_2NN(n_min=10, algorithm='base', mu_fraction=0.9, set_attr=False, return_sizes=False)[source]

Compute the id with the 2NN algorithm at different scales.

The different scales are obtained by sampling subsets of [N, N/2, N/4, N/8, …, n_min] data points.

Parameters:
  • n_min (int) – minimum number of points considered when decimating the dataset, n_min effectively sets the largest ‘scale’;

  • algorithm (str) – ‘base’ to perform the linear fit, ‘ml’ to perform maximum likelihood;

  • mu_fraction (float) – fraction of mus that will be considered for the estimate (discard highest mus).

Returns:
  • ids_scaling (np.ndarray(float)) – array of intrinsic dimensions;

  • ids_scaling_err (np.ndarray(float)) – array of error estimates;

  • scales (np.ndarray(int)) – array of maximum nearest neighbor rank included in the estimate

Quick Start:

from dadapy import Data
from sklearn.datasets import make_swiss_roll

#two dimensional curved manifold embedded in 3d with noise

n_samples = 5000
X, _ = make_swiss_roll(n_samples, noise=0.3)

ie = Data(coordinates=X)
ids_scaling, ids_scaling_err, rs_scaling = ie.return_id_scaling_2NN(n_min = 20)

ids_scaling:
array([2.88 2.77 2.65 2.42 2.22 2.2  2.1  2.23])

ids_scaling_err:
array([0.   0.02 0.05 0.04 0.04 0.03 0.04 0.04])

scales:
array([2  4  8  16  32  64  128  256])
return_id_scaling_gride(range_max=64, d0=0.001, d1=1000, eps=1e-07, set_attr=False, return_ranks=False)[source]

Compute the id at different scales using the Gride algorithm.

Parameters:
  • range_max (int) – maximum nearest neighbor rank considered for the id computations; the number of id estimates are log2(range_max) as the nearest neighbor order (‘scale’) is doubled at each estimate;

  • d0 (float) – minimum intrinsic dimension considered in the search;

  • d1 (float) – maximum intrinsic dimension considered in the search;

  • eps (float) – precision of the approximate id calculation.

  • set_attr (bool) – whether to change the class attributes as a result of the computation

Returns:
  • ids_scaling (np.ndarray(float)) – array of intrinsic dimensions of length log2(range_max);

  • ids_scaling_err (np.ndarray(float)) – array of error estimates;

  • rs_scaling (np.ndarray(float)) – array of average distances of the neighbors involved in the estimates.

Quick Start:

from dadapy import Data
from sklearn.datasets import make_swiss_roll

#two dimensional curved manifold embedded in 3d with noise

n_samples = 5000
X, _ = make_swiss_roll(n_samples, noise=0.3)

ie = Data(coordinates=X)
ids_scaling, ids_scaling_err, rs_scaling = ie.return_id_scaling_gride(range_max = 512)

ids_scaling:
array([2.81 2.71 2.48 2.27 2.11 1.98 1.95 2.05])

ids_scaling_err:
array([0.04 0.03 0.02 0.01 0.01 0.01 0.   0.  ])

rs_scaling:
array([0.52 0.69 0.93 1.26 1.75 2.48 3.54 4.99])

References

F. Denti, D. Doimo, A. Laio, A. Mira, Distributional results for model-based intrinsic dimension estimators, arXiv preprint arXiv:2104.13832 (2021).

set_id(d)[source]

Set the intrinsic dimension.