The id_discrete module

The id_discrete module contains the IdDiscrete class.

The different algorithms of intrinsic dimension estimation for discrete spaces

are implemented as methods of this class.

class id_discrete.IdDiscrete(coordinates=None, distances=None, is_network=False, maxk=None, condensed=None, weights=None, verbose=False, n_jobs=2)[source]

Estimates the intrinsic dimension of a dataset with discrete features using a binomial likelihood.

Inherits from class Base.

lk

radius of the external shells

Type:

int, np.ndarray(int)

ln

radius of the internal shells

Type:

int, np.ndarray(int)

k

total number of points within the external shell

Type:

np.ndarray(int)

n

total number of points within the internal shell

Type:

np.ndarray(int)

ratio

ratio between internal and external radii

Type:

float

intrinsic_dim

intrinsic dimension obtained with the binomial estimator

Type:

float

intrinsic_dim_err

error associated with the id estimation. Computed through Cramer-Rao or Bayesian inference

Type:

float

intrinsic_dim_scale

scale at which the id has been computed

Type:

float

posterior_domain

eventual support of the posterior distribution of the id

Type:

np.ndarray(float)

posterior

posterior distribution when evaluated with Bayesian inference

Type:

np.ndarray(float)

compute_distances(maxk=None, metric='manhattan', period=None, condensed=None, d_max=100)[source]

Compute distances between datapoints. Distances can be saved in an extended or compacted shape.

If condensed is True, self.distances will store how many points are present up to that distance. Conversely, the usual scheme is adopted (see same method in Base)

Parameters:
  • maxk – maximum number of neighbours for which distance is computed and stored

  • metric – type of metric

  • period (float or np.ndarray) – periodicity (only used for periodic distance computation). Default is None.

  • condensed (bool) – save how many points one finds at each distance instead of all distances

  • d_max (int, default=100) – decide the farthest distance to be saved (only if condensed is True)

compute_id_binomial_k_discrete(k, ratio=0.5, shell=False, subset=None, approx_err=True, set_attr=True)[source]

Calculate Id using the binomial estimator by fixing the number of neighbours or shells.

As in the case in which one fix lk, also in this version of the estimation one removes the central point from n and k. Two different ways of computing k,n,lk,ln are available, whether one intends k as the k-th neighbour or the k-th shell. In both cases, one wants to be sure that the shell chosen is

Parameters:
  • k (int) – order of neighbour that set the external shell

  • ratio (float) – ratio between internal and external shell

  • shell (bool) – k stands for number of neighbours or number of occupied shells

  • subset (int) – choose a random subset of the dataset to make the Id estimate

  • approx_err (bool, default=True) – if True, computes the error on the estimate using the CR. Otherwise it profiles the likelihood and compute its std (takes longer)

  • set_attr (bool, default=True) – assign id, id error and scale to the class

Returns:
  • intrinsic_dim (float) – the id estimation

  • intrinsic_dim_err (float) – the error estimate on the id

  • intrinsic_dim_scale (float) – the scale at which the id was computed: <lk>

compute_id_binomial_lk(lk=None, ln=None, method='bayes', subset=None, plot=True, set_attr=True)[source]

Calculate Id using the binomial estimator by fixing the eternal radius for all the points.

In the estimation of d one has to remove the central point from the counting of n and k as it generates the process, but it is not effectively part of it

Parameters:
  • lk (int) – radius of the external shell

  • ln (int) – radius of the internal shell

  • subset (int) – choose a random subset of the dataset to make the Id estimate

  • method (str, default 'bayes') – choose method between ‘bayes’ and ‘mle’. The bayesian estimate gives the distribution of the id, while mle only the max of the likelihood

  • plot (bool) – if bayes method is used, one can decide whether to plot the posterior

  • set_attr (bool) – changes class attributes after computation

Returns:
  • intrinsic_dim (float) – the id esteem

  • intrinsic_dim_err (float) – the error estimate on the id

  • intrinsic_dim_scale (float) – the scale at which the id was computed (lk)

compute_local_density(plot=True)[source]

Compute and compare the rescaled local density of points.

Compute and compare the local density of points inside inner and outer shells after fixing lk and ln as n/<n> and (k-n)/<k-n>. The closest the points to the y=x line, the more the local uniform density hypothesis is respected, the better will be the id estimate

Parameters:

plot (bool, default=True) – decide whether to plot the local density

Returns:
  • n (np.ndarray(float)) – n/<n>

  • m (np.ndarray(float)) – (k-n)/<k-n>

fix_k(k_eff=None, ratio=0.5)[source]

Compute Rk, Rn, n for each point given a selected value of k.

This routine computes external radii lk, internal radii ln and internal points n. It also ensures that lk is scaled onto the most external shell for which we are sure to have all points. A mask will exclude “badly behaved” points if condensed is False. NB As a consequence the k will be point dependent

Parameters:
  • k_eff (int, default=self.k_max) – selected (max) number of NN

  • ratio (float, default = 0.5) – approximate ratio among ln and lk

fix_k_shell(k_shell, ratio=0.5)[source]

Compute the lk, ln, n given k_shell.

This routine computes the external radius lk, the associated points k, internal radius ln and associated points n. The computation is performed starting from a given number of shells. It ensures that Rk is scaled onto the most external shell for which we are sure to have all points. NB in this case we will have an effective, point dependent k

Parameters:
  • k_shell (int) – selected (max) number of considered shells

  • ratio (float) – ratio among Rn and Rk

fix_lk(lk=None, ln=None)[source]

Compute the k points within the given Rk and n points within given Rn.

For each point, computes the number self.k of points within a sphere of radius Rk and the number self.n within an inner sphere of radius Rk. It also provides a mask to take into account those points for which the statistics might be wrong, i.e. k == self.maxk, meaning that all available points were selected or when we are not sure to have all the point of the selected shell. If self.maxk is equal to the number of points of the dataset no mask will be applied, as all the point were surely taken into account. The mask is not necessary (i.e. is set to True) if condensed is True, as all points will be considered. Eventually add the weights of the points if they have an explicit multiplicity

Parameters:
  • lk (int) – external shell radius

  • ln (int) – internal shell radius

model_validation_full(alpha=0.05, subset=None, artificial_samples=100000, pdf=False, cdf=True, filename=None)[source]

Use Kolmogorov-Smirnoff test to assess the goodness of the id estimate.

In order to validate estimate of the intrinsic dimension and the model and the goodness of the binomial estimator we perform a KS test. In particular, once the ID has been computed, we generate a new set n_i starting from the k_i, lk_, ln_i and d using the binomial distribution (ln and lk can be scalars if the id estimate has been performed with fixed radii). We then compare the CDF obtained from both the n_empirical and the new n_i using the KS test, looking at the maximum distance between the two CDF

Parameters:
  • alpha (float, default=0.05) – tolerance for the KS test

  • subset (int or np.ndarray(int)) – subset of points used to perform the model validation

  • artificial_samples (int, default=100000) – number of points sampled from the theoretical distribution

  • pdf (bool, default=False) – plot histogram of n_emp and n_i

  • cdf (bool, default=False) – plot cdf of n_emp and n_i

  • filename (str, default=None) – directory where to save plots

Returns:
  • s (float) – KS statistics, max distance between empirical and theoretical cdfs

  • pv (float) – p-value associated to the KS statistics

return_id_fit(scales, subset=None, plot=True)[source]

Find the ID as least square fit between the actual and the theoretical number of points.

Compute the id as least square fit between the average number of neighbours N(r) and the theoretical number of points within a given shell rho*V(r). Given a radius r, the fit takes into account all points with r_i <= r. The fit is performed both between N and rho*V and their logarithms.

Parameters:
  • scales (list(int) or np.ndarray(int)) – radii at which the number of neighbours is computed

  • subset (np.ndarray(int)) – indices of points to be used for the estimate

  • plot (bool) – plot the fit taking into account all given radii in R and ID(r)

Returns:
  • ids (np.ndarray(float)) – the intrinsic dimension estimates as a function of R with “normal” fit

  • ids (np.ndarray(float)) – the intrinsic dimension estimates as a function of R with “log” fit

  • scales (np.ndarray(int)) – the scales at which the ID is computed (the first point is excluded)

return_id_fit_continuum(R, fit_intercept=True, subset=None, plot=True)[source]

Find the ID as linear fit between Log(n) vs Log(r), assuming N~r**d.

Compute the id fitting a line between Log(n) and Log(r). Given a radius r_i, the fit takes into account all points with r <= r_i. Since no prescription has been given for this method on how to deal with points lying at discrete distances, we suppose to apply a small smearing, so that half of the points will end up a bit further than their initial distance and half will be slightly closer. As a consequence, when looking at distance r, we will be considering all points up to r-1 and half of those at distance r

Parameters:
  • R (list(int) or np.ndarray(int)) – radii at which the number of neighbours is computed

  • fit_intercept (bool, default=True) – decide whether to fit the intercept or fix it to N(r=1)

  • subset (np.ndarray(int)) – indices of points to be used for the estimate

  • plot (bool) – plot the fit taking into account all given radii in R and ID(r)

Returns:
  • ids (np.ndarray(float)) – the intrinsic dimension estimates as a function of R

  • R (np.ndarray(int)) – the scales at which the ID is computed (the first point is excluded)

return_id_scaling(Lks, r=0.5, method='mle', subset=None, plot=True)[source]

Compute the ID by varying the radii.

Parameters:
  • Lks (list or np.ndarray(int)) – external radii lk

  • r (float, default = 0.5) – ratio among ln and lk

  • method (string, default='mle') – method to compute the id

  • subset (np.ndarray(int) or int, default=None) – indices of points to be used for the estimate

  • plot (bool, default=True) – whether to plot the id scaling

Returns:
  • ids (np.ndarray) – intrinsic dimension at different scales

  • ids_err (np.ndarray) – error estimate on intrinsic dimension at different scales

Quick Start:

import numpy as np
import matplotlib.pyplot as plt
rng = np.random.default_rng(12345)
from dadapy.id_discrete import IdDiscrete

#uniformly sampled points on 5d lattice

N = 2500
box = 50
d = 5
data = rng.integers(0,box,size=(N, d))

I3D = IdDiscrete(data, condensed=True, maxk=data.shape[0])
I3D.compute_distances(metric='manhattan',d_max=box*d,period=box)

scales = np.arange(15,30)

ids_scaling, ids_scaling_err = I3D.return_id_scaling(scales,r=0.75)

ids_scaling:
array([4.91, 4.86, 4.86, 4.94, 4.98, 5.  , 5.  , 4.99, 5.01, 5.  , 5.01, 5.01, 5.02, 5.01, 5.01])

ids_scaling_err:
array([0.09, 0.08, 0.07, 0.06, 0.05, 0.05, 0.04, 0.04, 0.03, 0.03, 0.03, 0.02, 0.02, 0.02, 0.02])

References

Iuri Macocco, Aldo Glielmo, Jacopo Grilli, and Alessandro Laio, Intrinsic Dimension Estimation for Discrete Metrics, Phys. Rev. Lett. 130, 067401 – Published 8 February 2023

return_id_scaling_k(Ks, r=0.5, subset=None, plot=True)[source]

Compute the ID by varying the number of neighbours considered.

Parameters:
  • Ks (list or np.ndarray(int)) – number of neighbours

  • r (float, default = 0.5) – ratio between ln and lk

  • subset (int) – choose a random subset of the dataset to make the Id estimate

  • plot (bool, default=True) – whether to plot the id as a function of Ks

Returns:
  • ids (np.ndarray(float)) – intrinsic dimension at different scales

  • ids_err (np.ndarray(float)) – error estimate on intrinsic dimension at different scales

  • scale (np.ndarray(float)) – scale at which the id was computed (mean values of lk)

set_id(d)[source]

Set id manually.

Parameters:

d (float) – intrinsic dimension to assign to the class

set_lk_ln(lk, ln)[source]

Assign lk and ln to class.

Parameters:
  • lk (int) – radius of external shells

  • ln (int) – radius of internal shells

set_ratio(r)[source]

Set ratio between internal and external shells.

Parameters:

r (float) – ratio, 0<r<1 is required

set_w(w)[source]

Set weights of points.

Parameters:

w (np.ndarray(int)) – multiplicity of points. Needs to have the same length of points