Cosine similarity between query and document confusion

Question

I am going through the Manning book for Information retrieval. Currently I am at the part about cosine similarity. One thing is not clear for me.

Let's say that I have the tf idf vectors for the query and a document. I want to compute the cosine similarity between both vectors.

When I compute the magnitude for the document vector, do I sum the squares of all the terms in the vector or just the terms in the query?

Here is an example : we have user query "cat food beef" .
Lets say its vector is (0,1,0,1,1).( assume there are only 5 directions in the vector one for each unique word in the query and the document)
We have a document "Beef is delicious"
Its vector is (1,1,1,0,0). We want to find the cosine similarity between the query and the document vectors.

Could you provide an example for the problem you are solving? — Nischal Hp, Nov 05 '17 at 15:40
Cross-site duplicates: on Stack Overflow, on Cross Validated — unor, Nov 07 '17 at 07:28

score 1 · Accepted Answer · answered Nov 05 '17 at 23:59

You want to use all of the terms in the vector.

In your example, where your query vector $\mathbf{q} = [0,1,0,1,1]$ and your document vector $\mathbf{d} = [1,1,1,0,0]$, the cosine similarity is computed as

similarity $= \frac{\mathbf{q} \cdot \mathbf{d}}{||\mathbf{q}||_2 ||\mathbf{d}||_2} = \frac{0\times1+1\times1+0\times1+1\times0+1\times0}{\sqrt{1^2+1^2+1^2} \times \sqrt{1^2+1^2+1^2}} = \frac{0+1+0+0+0}{\sqrt{3}\sqrt{3}} = \frac{1}{3}$

Cosine similarity between query and document confusion

1 Answers1

Linked