@HarryUp
2017-04-07T02:20:00.000000Z
字数 8166
阅读 752
Paper_Note
intrinsic_dimension
The ID estimation methods depend on the scale of data and still suffer from curse of dimensionality (robustness).It is considered to provide a lower bound on the cardinality in order to guarantee an accurate ID estimation, however, it is available only for fractal-based methods and Little-Jung-Maggioni's algorithm.
The defination of intrinsic deminsion:
the categories of ID estimation:
Global methods use the whole data set making implicitly the assumption that data lie on a unique manifold of a fixed dimen-sionality.
Local methods are algorithms that provide an ID estimation using the information contained in sample neighborhoods.
Pointwise methods, in this category there are the algorithms that can produce both a global ID estimate of the whole data set and local pointwise ID estimate of each pattern of the data set.
Meanwhile, the ideal ID estimator is considered to be of the following characters:
ID method evaluaton:
The more specific description of Fractal-based methods (GP+CV):
Fractal-based methods are global methods that were originally proposed in physics to estimate the attractor dimension of nonlinear systems. Since fractals are generally characterized by a non-integer dimensionality (e.g., Koch's curve dimension is ), these methods are called fractal.
Kégl's algorithm
Since Hausdorff dimension is very hard to estimate, fractal methods usually replace it with an upper bound, the Box-Counting Dimension. Let be a data set, we denote by ν(r) the number of the boxes (i.e., hypercubes) of size r required to cover . It can be proved that , where is the dimension of the set. This motivates the following definition. The Box-Counting Dimension (or Kolmogorov capacity) of the set is defined by
Grassberger–Procaccia algorithm (GP)
A good alternative to the Box-Counting Dimension, among many proposed is the Correlation Dimension, defined as follows. If the correlation integral is given by:
Takens' method
Takens proposed a method, based on the Maximum Likelihood principle, that estimates the expectation value of Correlation Dimension. Let be the set formed by the Euclidean distances (denoted by ), between data points of , lower than the so-called cut-off radius . Using the Maximum Likelihood principle Takens proved that the expectation value of the Correlation Dimension is:
Work envelope of fractal-based methods
Differently from the other ID methods described before, fractal-based methods satisfy, in addition to the third one, the fourth Ideal ID requirement, i.e., they have a work envelope. They provide a lower bound that the cardinality of data set must fulfill in order to get an accurate ID estimate. Eckmann and Ruelle proved that to get an accurate estimate of the dimension M,the data set cardinality has to satisfy the following inequality:
To improve the reliability of the ID estimate when the cardinality does not fulfill the inequality, the method of surrogate data was proposed. The method of surrogate data is based on bootstrap. Given a dataset , the method consists in creating a new synthetic data set , with larger cardinality, that has the same mean, variance and Fourier Spectrum of . Although the cardinality of can be chosen arbitrarily, the method of surrogate data becomes infeasible when the dimensionality of the data set is high. For instance, a 50-dimensional data set to be estimated must have at least data points, on the basis of the inequality.
Camastra–Vinciarelli's algorithm (CV)
For the reasons described above, Fractal-based algorithms do not satisfy the third Ideal ID requirement, i.e., they do not provide reliable ID estimate when the cardinality of the data set is high. In order to cope with this problem, Camastra and Vinciarelli proposed an algorithm to power Grassberger and Procaccia method (GP method) w.r.t. high dimensionality, evaluating empirically how much the GP method underestimates the dimensionality of a data set when the data set cardinality is unadequate to estimate ID properly. Let be a data set, Camastra–Vinciarell's algorithm has the following steps:
The algorithm assumes that the curve depends on and its dependence on sets are negligible. It is worth mentioning that OganovandValle used Camastra–Vinciarelli's algorithm to estimate ID of high dimensional Crystal Fingerprint spaces.