0

First of all, let me state that I'm engineer, not mathematician, and finished my study 25 years ago, so I'm quite rusted, so please forgive me if I'm asking something obvious or stupid. My problem is this:

I have over 100ths of vectors in a high dimensional space (over 1000 parameters, yes, it's from deep learning problem), and I want to get the closest n vectors to a given vector using cosine distance. Naive approach, I calculate the distance to all vectors one by one, but it takes lots of compute power.

Now the part that I might ask the stupid question: If these vectors were in 3-d space would first fix the max distance I want, say in the radius of x, and I would find the vectors which are inside the sphere with radius x from the given vector. As a result, I would filter out all vectors which they are having distance greater than x in any dimension. For example, if the given vector sits on (0, 0, 0), and I'm looking for all vectors in the radius of 2, I would filter out all vectors which are having greater than 2 in any dimension, i.e (0, 0, 2), (0, 2, 0) or (2, 0, 0) because their distance to the given vector (0, 0, 0) would be obviously greater than 2. Then I would calculate the cosine or Euclidian distance of the vectors for the remaining vectors.

I suspect this approach is not applicable in higher dimensions, so is there any approach that I can filter out the distant vectors quickly before calculating the distance?

  • Can you explicitate what you mean by "cosine distance" for example in 3D ? Is it $\vec{u}$ close to $\vec{v} \iff \cos(\vec{u},\vec{v})$ close to $1$ ? – Jean Marie Oct 30 '20 at 06:32
  • @JeanMarie I mean after filtering out the distant vectors, I would either calculate Euclidian distance or calculate the angle. Actually my problem is not how to calculate the distance but how to quickly filter out the most distant vectors quickly without calculating the distance. – Ahmet Cetin Oct 30 '20 at 06:38
  • 1
    Maybe by projecting all your data on the unit hypersphere, then use an adapted quadtree data structure such as described here https://arxiv.org/abs/cs/0507049 ? – Jean Marie Oct 30 '20 at 06:55
  • 2
    Your 3D filtering procedure filters out points outside a cubical box of side length 2, centred on the origin. That also works in higher dimensions. But the problem is that the ratio of hypervolumes of an n-ball in an n-box becomes very small as n gets large. So there will still be a large proportion of unwanted points inside the n-box that are outside the n-ball. Eg, for $n=1000$ the ball/box ratio is $\approx 2.8743×10^{-1187}$. Also see https://math.stackexchange.com/a/2644914/207316 & https://math.stackexchange.com/a/258558/207316 – PM 2Ring Oct 30 '20 at 07:12

0 Answers0