1

I am going through the Manning book for Information retrieval. Currently I am at the part about cosine similarity. One thing is not clear for me.

Let's say that I have the tf idf vectors for the query and a document. I want to compute the cosine similarity between both vectors.

When I compute the magnitude for the document vector, do I sum the squares of all the terms in the vector or just the terms in the query?

Here is an example : we have user query "cat food beef" .
Lets say its vector is (0,1,0,1,1).( assume there are only 5 directions in the vector one for each unique word in the query and the document)
We have a document "Beef is delicious"
Its vector is (1,1,1,0,0). We want to find the cosine similarity between the query and the document vectors.

AutisticRat
  • 123
  • 1
  • 6

1 Answers1

1

You want to use all of the terms in the vector.

In your example, where your query vector $\mathbf{q} = [0,1,0,1,1]$ and your document vector $\mathbf{d} = [1,1,1,0,0]$, the cosine similarity is computed as

similarity $= \frac{\mathbf{q} \cdot \mathbf{d}}{||\mathbf{q}||_2 ||\mathbf{d}||_2} = \frac{0\times1+1\times1+0\times1+1\times0+1\times0}{\sqrt{1^2+1^2+1^2} \times \sqrt{1^2+1^2+1^2}} = \frac{0+1+0+0+0}{\sqrt{3}\sqrt{3}} = \frac{1}{3}$

timleathart
  • 3,940
  • 21
  • 35