Why is exact nearest neighbor search hard in high dimensional spaces?

Question

I started research on nearest neighbor search in IR a couple of weeks ago. I am still very new to this field, but what I discovered so far from literature is:

1) For the exact nearest neighbor search problem, there are only efficient algorithms, if the dimension of your dataset is $<10$. With increasing dimension every e.g. tree-based algorithm seems to degrade to a linear search.

2) For datasets with a very high dimension (lets say $d > 10000$), the only feasible way of doing nearest neighbor search seems to be using mapping based approximations like LSH (locality sensitive hashing).

Most of the papers I have found so far are more about the technical details of specific techniques to solve the problem. But, before diving deep into these details, I would like to understand the problem in a more abstract way.

So my $\textbf{question}$(s) are:

a) What are theorems/theories, which describe, why tree based methods degrade to linear search, as dimension grows?

b) Are there theorems/theories, which explain the general feasibility of exact nearest neighbor search, depending on the dimension of the data?

c) Are there theorems/theories, which explain, why we need to give up exact nearest neighbor search and go to approximate nearest neighbor search? Do they give hints about the relation of the quality of the approximation (e.g. the success probability of hashing) and the dimension of the data?

I am sorry to ask three questions now, but as they are so strongly related to each other and as I am yet not experienced enough to ask the (maybe existing) unifying meta-question, I hope it is still ok.

edit:

To be more precise: I am interested, if there are theorems in the literature for the (quantitative) relations between:

1) The quality of an approximation of NN-search (e.g. the probability of success that an algorithm returns the true NN).

2) The (intrinsic/metric) dimension of the data.

3) The query time.

4) The space requirements.

From my limited knowledge of this field, I see the tradeoff between this properties, and I understand, why high dimensions can make distances "not meaningful". But, except for some results for theoretic bounds for specific techniques, I do not know any general theory or any general theorems for this fact yet. I would be very thankful, if you (the community) can give me hints, where to invest further research, where I may find, what I am searching for.

The number of neighbors at distance $k$ grows dramatically with the dimension, while the expected distance of searching also grows (since there is more potential for noise). This might be the culprit. — Yuval Filmus, May 19 '15 at 14:18
Thank you for your comment! I also guess this is the reason for it. But reading the common literature (e.g. for LSH), and as far as I could understand it, I could not find any general (quantitave) relation, between:

the quality of approximation of NN-search

the query time

the space requirements

the (maybe just intrinsic) dimension of the data

There are general theorems about the lower/upper bounds for LSH (e.g. Indyk/Motwani/Andoni) and about specific spatial algorithms (e.g. Edgar Chavéz).

So I could specify my question more: what are known theorems about this relation(s)? — Jonas Köhler, May 19 '15 at 14:53
If you have a more specific question, that you should edit your question rather than add a comment. Unfortunately I won't be able to help too much since I'm not really familiar with this area. — Yuval Filmus, May 19 '15 at 15:26
Oh ok. Don't worry. I will update my question, to make it more precise. Thank you for your comment! — Jonas Köhler, May 19 '15 at 15:38
it highly depends on the data & how well the hashing functions "separate". a lot of this has to be studied empirically. and by the way, the idea that nearest neighbor search is "hard" in high dimensional spaces is something of an assumption, dont think its theoretically proven so far in much sense. are you talking at all about text searches like LSA/ LSI? — vzn, May 19 '15 at 17:49

Why is exact nearest neighbor search hard in high dimensional spaces?

0 Answers0

Linked