The id_discrete module
The id_discrete module contains the IdDiscrete class.
- The different algorithms of intrinsic dimension estimation for discrete spaces
are implemented as methods of this class.
- class id_discrete.IdDiscrete(coordinates=None, distances=None, is_network=False, maxk=None, condensed=None, weights=None, verbose=False, njobs=2)[source]
Estimates the intrinsic dimension of a dataset with discrete features using a binomial likelihood.
Inherits from class Base.
- lk
radius of the external shells
- Type:
int, np.ndarray(int)
- ln
radius of the internal shells
- Type:
int, np.ndarray(int)
- k
total number of points within the external shell
- Type:
np.ndarray(int)
- n
total number of points within the internal shell
- Type:
np.ndarray(int)
- ratio
ratio between internal and external radii
- Type:
float
- intrinsic_dim
intrinsic dimension obtained with the binomial estimator
- Type:
float
- intrinsic_dim_err
error associated with the id estimation. Computed through Cramer-Rao or Bayesian inference
- Type:
float
- intrinsic_dim_scale
scale at which the id has been computed
- Type:
float
- posterior_domain
eventual support of the posterior distribution of the id
- Type:
np.ndarray(float)
- posterior
posterior distribution when evaluated with Bayesian inference
- Type:
np.ndarray(float)
- K_mod_val(R_win, subset=None, path=None, pdf=False, cdf=True, recap=False)[source]
Use Kolmogorov-Smirnoff test to assess the goodness of the estimate at fixed R within certain windows in k
For a fixed value of K and for given values of windows of R, compute artificial estimates for the n and compare them to the empirical ones using the KS test
- Parameters:
R_win (np.ndarray(int)) – boundaries of the windows in R
subset (int or np.ndarray(int)) – subset of points used to perform the model validation
path (string) – if provided, save the plots and observables to the provided folder
pdf (bool) – plot empirical vs model pdfs
cdf (bool) – plot empirical vs model cdfs
recap (bool) – plot id and p-values for each window
- R_mod_val(k_win, subset=None, path=None, pdf=False, cdf=True, recap=False)[source]
Use Kolmogorov-Smirnoff test to assess the goodness of the estimate at fixed R within certain windows in k
For a fixed value of R and for given values of windows of k, compute artificial estimates for the n and compare them to the empirical ones using the KS test
- Parameters:
k_win (np.ndarray(int)) – boundaries of the windows in k
subset (int or np.ndarray(int)) – subset of points used to perform the model validation
path (string) – if provided, save the plots and observables to the provided folder
pdf (bool) – plot empirical vs model pdfs
cdf (bool) – plot empirical vs model cdfs
recap (bool) – plot id and p-values for each window
- compute_distances(maxk=None, metric='manhattan', period=None, condensed=None, d_max=100)[source]
Compute distances between datapoints. Distances can be saved in an extended or compacted shape.
If condensed is True, self.distances will store how many points are present up to that distance. Conversely, the usual scheme is adopted (see same method in Base)
- Parameters:
maxk – maximum number of neighbours for which distance is computed and stored
metric – type of metric
period (float or np.array) – periodicity (only used for periodic distance computation). Default is None.
condensed (bool) – save how many points one finds at each distance instead of all distances
d_max (int, default=100) – decide the farthest distance to be saved (only if condensed is True)
- compute_id_binomial_k_discrete(k, ratio=0.5, shell=False, subset=None, approx_err=True, set_attr=True)[source]
Calculate Id using the binomial estimator by fixing the number of neighbours or shells
As in the case in which one fix lk, also in this version of the estimation one removes the central point from n and k. Two different ways of computing k,n,lk,ln are available, whether one intends k as the k-th neighbour or the k-th shell. In both cases, one wants to be sure that the shell chosen is
- Parameters:
k (int) – order of neighbour that set the external shell
ratio (float) – ratio between internal and external shell
shell (bool) – k stands for number of neighbours or number of occupied shells
subset (int) – choose a random subset of the dataset to make the Id estimate
approx_err (bool, default=True) – if True, computes the error on the estimate using the CR. Otherwise it profiles the likelihood and compute its std (takes longer)
set_attr (bool, default=True) – assign id, id error and scale to the class
- Returns:
intrinsic_dim (float) – the id estimation
intrinsic_dim_err (float) – the error estimate on the id
intrinsic_dim_scale (float) – the scale at which the id was computed: <lk>
- compute_id_binomial_lk(lk=None, ln=None, method='bayes', subset=None, plot=True, set_attr=True)[source]
Calculate Id using the binomial estimator by fixing the eternal radius for all the points
In the estimation of d one has to remove the central point from the counting of n and k as it generates the process, but it is not effectively part of it
- Parameters:
lk (int) – radius of the external shell
ln (int) – radius of the internal shell
subset (int) – choose a random subset of the dataset to make the Id estimate
method (str, default 'bayes') – choose method between ‘bayes’ and ‘mle’. The bayesian estimate gives the distribution of the id, while mle only the max of the likelihood
plot (bool) – if bayes method is used, one can decide whether to plot the posterior
set_attr (bool) – changes class attributes after computation
- Returns:
intrinsic_dim (float) – the id esteem
intrinsic_dim_err (float) – the error estimate on the id
intrinsic_dim_scale (float) – the scale at which the id was computed (lk)
- compute_local_density(plot=True)[source]
Compute and compare the rescaled local density of points
Compute and compare the local density of points inside inner and outer shells after fixing lk and ln as n/<n> and (k-n)/<k-n>. The closest the points to the y=x line, the more the local uniform density hypothesis is respected, the better will be the id estimate
- Parameters:
plot (bool, default=True) – decide whether to plot the local density
- Returns:
n (np.ndarray(float)) – n/<n>
m (np.ndarray(float)) – (k-n)/<k-n>
- fix_k(k_eff=None, ratio=0.5)[source]
Computes Rk, Rn, n for each point given a selected value of k
This routine computes external radii lk, internal radii ln and internal points n. It also ensures that lk is scaled onto the most external shell for which we are sure to have all points. A mask will exclude “badly behaved” points if condensed is False. NB As a consequence the k will be point dependent
- Parameters:
k_eff (int, default=self.k_max) – selected (max) number of NN
ratio (float, default = 0.5) – approximate ratio among ln and lk
Returns:
- fix_k_shell(k_shell, ratio=0.5)[source]
Computes the lk, ln, n given k_shell
This routine computes the external radius lk, the associated points k, internal radius ln and associated points n. The computation is performed starting from a given number of shells. It ensures that Rk is scaled onto the most external shell for which we are sure to have all points. NB in this case we will have an effective, point dependent k
- Parameters:
k_shell (int) – selected (max) number of considered shells
ratio (float) – ratio among Rn and Rk
- fix_lk(lk=None, ln=None)[source]
Computes the k points within the given Rk and n points within given Rn.
For each point, computes the number self.k of points within a sphere of radius Rk and the number self.n within an inner sphere of radius Rk. It also provides a mask to take into account those points for which the statistics might be wrong, i.e. k == self.maxk, meaning that all available points were selected or when we are not sure to have all the point of the selected shell. If self.maxk is equal to the number of points of the dataset no mask will be applied, as all the point were surely taken into account. The mask is not necessary (i.e. is set to True) if condensed is True, as all points will be considered. Eventually add the weights of the points if they have an explicit multiplicity
- Parameters:
lk (int) – external shell radius
ln (int) – internal shell radius
Returns:
- model_validation_full(alpha=0.05, subset=None, pdf=False, cdf=True, path=None)[source]
Use Kolmogorov-Smirnoff test to assess the goodness of the id estimate
In order to validate estimate of the intrinsic dimension and the model and the goodness of the binomial estimator we perform a KS test. In particular, once the ID has been computed, we generate a new set n_i starting from the k_i, lk_, ln_i and d using the binomial distribution (ln and lk can be scalars if the id estimate has been performed with fixed radii). We then compare the CDF obtained from both the n_empirical and the new n_i using the KS test, looking at the maximum distance between the two CDF
- Parameters:
alpha (float, default=0.05) – tolerance for the KS test
subset (int or np.ndarray(int)) – subset of points used to perform the model validation
pdf (bool, default=False) – plot histogram of n_emp and n_i
cdf (bool, default=False) – plot cdf of n_emp and n_i
path (str, default=None) – directory where to save plots
- Returns:
s (float) – KS statistics, max distance between empirical and theoretical cdfs
pv (float) – p-value associated to the KS statistics
- return_id_fit(scales, subset=None, plot=True)[source]
Find the ID as least square fit between the actual and the theoretical number of points
Compute the id as least square fit between the average number of neighbours N(r) and the theoretical number of points within a given shell
- ho*V(r). Given a radius r, the fit takes into account all points with r_i <= r.
The fit is performed both between N and
ho*V and their logarithms.
- Args:
scales (list(int) or np.ndarray(int)): radii at which the number of neighbours is computed subset (np.ndarray(int)): indices of points to be used for the estimate plot (bool): plot the fit taking into account all given radii in R and ID(r)
- Returns:
ids (np.ndarray(float)): the intrinsic dimension estimates as a function of R with “normal” fit ids (np.ndarray(float)): the intrinsic dimension estimates as a function of R with “log” fit scales (np.ndarray(int)): the scales at which the ID is computed (the first point is excluded)
- return_id_fit_continuum(R, fit_intercept=True, subset=None, plot=True)[source]
Find the ID as linear fit between Log(n) vs Log(r), assuming N~r**d
Compute the id fitting a line between Log(n) and Log(r). Given a radius r_i, the fit takes into account all points with r <= r_i. Since no prescription has been given for this method on how to deal with points lying at discrete distances, we suppose to apply a small smearing, so that half of the points will end up a bit further than their initial distance and half will be slightly closer. As a consequence, when looking at distance r, we will be considering all points up to r-1 and half of those at distance r
- Parameters:
R (list(int) or np.ndarray(int)) – radii at which the number of neighbours is computed
fit_intercept (bool, default=True) – decide whether to fit the intercept or fix it to N(r=1)
subset (np.ndarray(int)) – indices of points to be used for the estimate
plot (bool) – plot the fit taking into account all given radii in R and ID(r)
- Returns:
ids (np.ndarray(float)) – the intrinsic dimension estimates as a function of R
R (np.ndarray(int)) – the scales at which the ID is computed (the first point is excluded)
- return_id_scaling(Lks, r=0.5, method='mle', subset=None, plot=True)[source]
Compute the ID by varying the radii
- Parameters:
Lks (list or np.ndarray(int)) – external radii lk
r (float, default = 0.5) – ratio among ln and lk
method (string, default='mle') – method to compute the id
subset (np.ndarray(int) or int, default=None) – indices of points to be used for the estimate
plot (bool, default=True) – whether to plot the id scaling
- Returns:
ids (np.ndarray) – intrinsic dimension at different scales
ids_err (np.ndarray) – error estimate on intrinsic dimension at different scales
Quick Start:
import numpy as np import matplotlib.pyplot as plt rng = np.random.default_rng(12345) from dadapy.id_discrete import IdDiscrete #uniformly sampled points on 5d lattice N = 2500 box = 50 d = 5 data = rng.integers(0,box,size=(N, d)) I3D = IdDiscrete(data, condensed=True, maxk=data.shape[0]) I3D.compute_distances(metric='manhattan',d_max=box*d,period=box) scales = np.arange(15,30) ids_scaling, ids_scaling_err = I3D.return_id_scaling(scales,r=0.75) ids_scaling: array([4.91, 4.86, 4.86, 4.94, 4.98, 5. , 5. , 4.99, 5.01, 5. , 5.01, 5.01, 5.02, 5.01, 5.01]) ids_scaling_err: array([0.09, 0.08, 0.07, 0.06, 0.05, 0.05, 0.04, 0.04, 0.03, 0.03, 0.03, 0.02, 0.02, 0.02, 0.02])
References
Iuri Macocco, Aldo Glielmo, Jacopo Grilli, and Alessandro Laio, Intrinsic Dimension Estimation for Discrete Metrics, Phys. Rev. Lett. 130, 067401 – Published 8 February 2023
- return_id_scaling_k(Ks, r=0.5, subset=None, plot=True)[source]
Compute the ID by varying the number of neighbours considered
- Parameters:
Ks (list or np.ndarray(int)) – number of neighbours
r (float, default = 0.5) – ratio between ln and lk
subset (int) – choose a random subset of the dataset to make the Id estimate
plot (bool, default=True) – whether to plot the id as a function of Ks
- Returns:
ids (np.ndarray(float)) – intrinsic dimension at different scales
ids_err (np.ndarray(float)) – error estimate on intrinsic dimension at different scales
scale (np.ndarray(float)) – scale at which the id was computed (mean values of lk)
- set_id(d)[source]
Set id manually :param id: intrinsic dimension to assign to the class :type id: float
- set_lk_ln(lk, ln)[source]
Assign lk and ln to class :param lk: radius of external shells :type lk: int :param ln: radius of internal shells :type ln: int