Method for finding top-k cosine similarity based closest item on large dataset

Question

I have a dataset with 40 million item, where each item is 400-dimension double vector. What I want to do is to find top-k (small k, about 3~10) most similar items to an arbitrary given input vector. Similarity measure is cosine similarity, since this dataset is based on word2vec representation.

However, this data is so huge, so it couldn't fit at main memory (I'm now working in single machine). The goal I want to achieve is to find top-k similar items as fast as possible, with small memory(~5G) that could fit in RAM.

Any suggestions for this problem? I've already tried PCA, but I observed that this data projected in low-dimension wasn't working very well..

score 5 · Accepted Answer · edited Aug 17 '20 at 15:55

Locality sensitive hashing is a great tool for this problem.

Pick n random 400-dimensional vectors. (Be careful or not all directions will be chosen with equal probability; pick each dimension as a standard Gaussian.) Each really defines a hyperplane through the origin cutting your space in half. The sign of the dot product of any of these vectors with some new vector tells you which side of the hyperplane it's on. So computing n dot products gives n 0/1 bits, which make an n-bit hash.

Any new vector hashing to the same value must be in the same small sliver of space from the origin. And those are exactly the vectors that have a high cosine similarity with each other since their mutual angles are very small. Likewise anything hashing to almost the same value -- differing in a few bits -- is likely to be nearby. So you can restrict your search for most-similar vectors to things within one or more buckets of hashed candidate vectors.

It doesn't help directly with memory since you might need any particular bucket to satisfy a request. You also lose some accuracy since there is not a guarantee the most similar vectors lie in the buckets you examine (though it's likely, the more you examine). It lets you trade speed for accuracy mostly. However you may find you can get away with some caching scheme, where some buckets are rarely if ever accessed and so don't stay in memory.

You can see an implementation of this in Oryx, which I think is pretty straightforward.

Most of the complexity comes because it lets you specify a target percentage of vectors to evaluate, and works out the optimal hash size based on that and your machine's number of cores.

score 5 · Answer 2 · answered Oct 10 '17 at 12:02

There are various approaches.

1. K-D Tree based on L2-normalized euclidean distance

K-D Tree is an efficient space split method. See wikipedia.
So it's used for euclidean distance neighbor searching intuitively.
Why I say it also works for cosine similarity, consider the equivalence between cosine similarity and L2-normalized euclidean distance someway.
When $\mathbf{x}$ and $\mathbf{y}$ are unit vectors, then:
\begin{align} ||\mathbf{x} - \mathbf{y}||_2^2 &= (\mathbf{x} - \mathbf{y})^\top (\mathbf{x} - \mathbf{y}) \\ &= \mathbf{x}^\top \mathbf{x} - 2 \mathbf{x}^\top \mathbf{y} + \mathbf{y}^\top \mathbf{y} \\ &= 2 - 2\mathbf{x}^\top \mathbf{y} \\ &= 2 - 2 \cos\angle(\mathbf{x}, \mathbf{y}) \end{align} So, it will produce identical results for a similarity ranking between a vector u and a set of vectors V. See an stack-exchange post.

2.The Priority Search K-Means Tree Algorithm

It's an approximate method.
Use cosine similarity as distance measure.
See the paper Scalable Nearest Neighbor Algorithms for High Dimensional Data.

3. Search within sub-search-space after The K-Means clustering

It's an approximate method.
After K-Means clustering via cosine similarity, we get k subsets and respective centroid. We can let k be 1000 or larger.
So for a given query point, our steps are:

find the nearest centroid, and find the corresponding subset.
search and sort in this subset.

Method for finding top-k cosine similarity based closest item on large dataset

2 Answers2

1. K-D Tree based on L2-normalized euclidean distance

2.The Priority Search K-Means Tree Algorithm

3. Search within sub-search-space after The K-Means clustering