I started research on nearest neighbor search in IR a couple of weeks ago. I am still very new to this field, but what I discovered so far from literature is:
1) For the exact nearest neighbor search problem, there are only efficient algorithms, if the dimension of your dataset is $<10$. With increasing dimension every e.g. tree-based algorithm seems to degrade to a linear search.
2) For datasets with a very high dimension (lets say $d > 10000$), the only feasible way of doing nearest neighbor search seems to be using mapping based approximations like LSH (locality sensitive hashing).
Most of the papers I have found so far are more about the technical details of specific techniques to solve the problem. But, before diving deep into these details, I would like to understand the problem in a more abstract way.
So my $\textbf{question}$(s) are:
a) What are theorems/theories, which describe, why tree based methods degrade to linear search, as dimension grows?
b) Are there theorems/theories, which explain the general feasibility of exact nearest neighbor search, depending on the dimension of the data?
c) Are there theorems/theories, which explain, why we need to give up exact nearest neighbor search and go to approximate nearest neighbor search? Do they give hints about the relation of the quality of the approximation (e.g. the success probability of hashing) and the dimension of the data?
I am sorry to ask three questions now, but as they are so strongly related to each other and as I am yet not experienced enough to ask the (maybe existing) unifying meta-question, I hope it is still ok.
edit:
To be more precise: I am interested, if there are theorems in the literature for the (quantitative) relations between:
1) The quality of an approximation of NN-search (e.g. the probability of success that an algorithm returns the true NN).
2) The (intrinsic/metric) dimension of the data.
3) The query time.
4) The space requirements.
From my limited knowledge of this field, I see the tradeoff between this properties, and I understand, why high dimensions can make distances "not meaningful". But, except for some results for theoretic bounds for specific techniques, I do not know any general theory or any general theorems for this fact yet. I would be very thankful, if you (the community) can give me hints, where to invest further research, where I may find, what I am searching for.
There are general theorems about the lower/upper bounds for LSH (e.g. Indyk/Motwani/Andoni) and about specific spatial algorithms (e.g. Edgar Chavéz).
So I could specify my question more: what are known theorems about this relation(s)?
– Jonas Köhler May 19 '15 at 14:53