Curse of dimensionality

There are three problems in high dimension

  • First, More length is needed to capture same rate of data.

  • Second, All sample points are close to an edge of the sample.

  • Third, We need much more sample in high dimension to capture same percentage of data in low dimension.

✏️ First Problem

Let's think about a unit hypercube in p dimension, {x  xi<1,xRp}\{x|\; x_i<1, x \in \mathbb{R}^p\}. We assume the density of x is uniformly distributed.

The average length of one side in p dimension is that: ep(r)=r1/p e_p(r)=r^{1/p}

There is a subcube in hypercube occupying r% of all data.

  • InR1\mathbb{R}^1, one side length of a subcube is 1/41/4 for 1/41/4 data.

  • InR2\mathbb{R}^2, one side length of a subcube is 1/21/2for 1/41/4data.

  • InR3\mathbb{R}^3, one side length of a subcube is 1/21/\sqrt{2} for 1/41/4data.

When dimension becomes higher, an amount of information becomes increased. More information means longer length of one side for same rate of data.

e10(0.01)=0.63e_{10}(0.01)=0.63

In R10\mathbb{R}^{10}, we need the length 0.63 to capture 1% of data.

✏️ Second Problem

All sample becomes close into an edge.

ωp=πp/2(p/2)!F(x)=xp,  0x1.f(x)=pxp1,  0x1.g(y)=n(1yp)n1pyp1G(y)=1(1yp)n\omega_p=\frac{\pi^{p/2}}{(p/2)!} \\ F(x)=x^p, \; 0\leq x \leq 1. \\ f(x)=px^{p-1}, \; 0\leq x \leq 1. \\ g(y)=n(1-y^p)^{n-1}py^{p-1} \\ G(y)=1-(1-y^p)^n
d(p,N)=(1121/N)1/pd(p,N)= (1-\frac{1}{2}^{1/N})^{1/p}

Find the y value satisfying G(y)=1/2G(y)=1/2. The distance the closest point from origin.

✏️ Third Problem

The distance becomes very short when the number of sample is large and the dimensionality is high. For our fixed sample, the sampling density is proportional to N1/pN^{1/p}

  • To maintain the same density, we need one hundred samples in one dimension. However, we need 10010100^{10}samples in 100 dimension to maintain the same density.

In this context, sampling density means the number of sample in unit length.

Last updated