17

I'm reading The Elements of Statistical Learning. I have a question about the curse of dimensionality.

In section 2.5, p.22:

Consider $N$ data points uniformly distributed in a $p$-dimensional unit ball centered at the origin. suppose we consider a nearest-neighbor estimate at the origin. The median distance from the origin to the closest data point is given by the expression:

$$d(p,N) = \left(1-\frac{1}{2^{1/N}} \right)^{1/p}.$$

For $N=500$, $p=10$, $d(p,N)\approx0.52$, more than halfway to the boundary. Hence most data points are closer to the boundary of the sample space than to any other data point.

I accept the equation. My question is, how we deduce this conclusion?

chyojn
  • 503
  • When you cite a book, please quote both its author and the title. (Are there really no upper-case letters in the title?) – hmakholm left over Monica Dec 11 '11 at 04:59
  • Can be downloaded for free from http://www-stat.stanford.edu/~tibs/ElemStatLearn/download.html and took a while, it is 763 pages. Title and authors only at http://www.springer.com/statistics/statistical+theory+and+methods/book/978-0-387-84857-0?changeHeader – Will Jagy Dec 11 '11 at 05:07

3 Answers3

21

This is exercise 2.3, which they mention there.

PDF mentions Probability Distribution Function. CDF means Cumulative Distribution Function. As we are considering continuous distributions, the former is the derivative of the latter.

The volume of a ball of radius $r$ in $\mathbb R^p$ is $\omega_p r^p,$ where $\omega_p$ is a constant depending only on $p,$ the value indicated by shorthand $$ \omega_p = \frac{\pi^{p/2}}{(p/2)!}.$$

As a result, the probability that a point, taken uniformly in the unit ball, is within distance $x$ of the origin is the volume of that ball divided by the volume of the unit ball. The factors of $\omega_p$ cancel, so we get CDF $$ F(x) = x^p, \; \; \; 0 \leq x \leq 1.$$ The corresponding PDF is the derivative, $$ f(x) = px^{p-1}, \; \; \; 0 \leq x \leq 1.$$

From page 150, section 4.6 of Introduction to Mathematical Statistics by Hogg and Craig, we are told that the marginal (individual) PDF for $y_1,$ the smallest order statistic (the minimum) of $n$ points with CDF $F$ and PDF $f$ is $$ g(y) = n \left( 1 - F(y) \right)^{n-1} f(y).$$

In our case that gives $$ g(y) = n \left( 1 - y^p \right)^{n-1} p y^{p-1},$$ which can be readily integrated to give the CDF $$ G(y) = 1 - \left( 1 - y^p \right)^n.$$

The mean, or expected value of $y,$ is a messy integral. Instead, the median is defined to be simply the value of the random variable $y$ such that $G(y)=1/2,$ in the case of a continuous variable. If you did this experiment many times the probability of getting a minimum smaller than the median is 50 percent, the probability of getting a minimum larger than the median is also 50 percent. For the traditional bell curve, the median and the mean are likely to be pretty close. In this case, a polynomial restricted to a short interval, I am not sure the median and mean are necessarily close to each other.

I do not see how you can read this book without a full semester calculus-based course in mathematical statistics.

Solve $G(y) = 1/2,$ you get their expression.


Will Jagy
  • 139,541
  • I found it is very hard to read this book. Is there any suggestion about the prerequisite course or textbook? – chyojn Dec 16 '11 at 13:43
  • @chyojn their online book says "We expect that the reader will have had at least one elementary course in statistics, covering basic topics including linear regression." There are surely easier books than Hogg and Craig that do that much. But I am not up to date on textbooks. There is even an easier way to explain this problem, but not entirely without calculus. – Will Jagy Dec 16 '11 at 20:25
  • Nice solution but isn't the cdf just the indefinite integral of the pdf from 0 to y in which case the 1 - at the beginning of your equation for G(y) should not be there? – Ben Oct 28 '14 at 19:32
2

If you accept the equation (do you?) then the closest data point to the origin is further than $0.5$ away, so is closer to the boundary than to the origin. The density of points doesn't vary over the ball by the definition of uniformity. So any other data point has even less chance to be near another because the volume near the data point outside the ball cannot have any points. So the median distance from any given data point to the nearest neighbor is at least 0.52 in this case.

Ross Millikan
  • 374,822
1

The instinct might be to use order statistics here, and indeed that works, but I think there is a simpler way using the fact we're dealing with the closest (minimum distance) point. For some distance from origin $d$ in $p$ dimensions with $N$ data points, see that the probability that one data point (call its distance from origin the random variable $X$) lies within $d$ of the origin is the ratio of the volumes in $p$ dimensions: $\frac{kd^p}{k\cdot 1^p} = d^p$ for some complicated constant $k$ that has to do with the formula for a high dimensional ball, which we can ignore because it cancels out. Then use this to see that the probability that one uniformly drawn point lies more than $d$ away from the origin $P(X > d) = 1-P(X\le d) = 1-d^p$. Then see that the closest lying $d$ from the origin implies the above condition is true for all $N$ points, which happens with probability $(1-d^p)^N$. Then see that we're interested in the case where this probability is $\frac{1}{2}$, so equating gives $(1-d^p)^N = \frac{1}{2} \implies d = (1-\frac{1}{2}^{1/N})^{1/p}$, as required.

This does not involve wrangling PDFs or CDFs, making full use of the fact we're dealing with a uniform distribution's first order statistic. If we were asked about another order statistic or another distribution, we would however have to use calculus on PDFs and CDFs.