6

I am trying to understand the difference between Jaccard and Cosine. However, there seem to be a disagreement in the answers provided in Applications and differences for Jaccard similarity and Cosine Similarity.

I am seeking if anyone could step me through the calculations of the Jaccard Similarity in this Cosine Similarity example from https://bioinformatics.oxfordjournals.org/content/suppl/2009/10/24/btp613.DC1/bioinf-2008-1835-File004.pdf

Given:

enter image description here

Question: How do we compute the Jaccard Similarity index between t1 and t2?

Thank you.

jkyh
  • 462
  • 1
  • 4
  • 13

1 Answers1

5

Cosine similarity is for comparing two real-valued vectors, but Jaccard similarity is for comparing two binary vectors (sets). So you cannot compute the standard Jaccard similarity index between your two vectors, but there is a generalized version of the Jaccard index for real valued vectors which you can use in this case:

$J_g(\Bbb{a}, \Bbb{b}) =\frac{\sum_i min(\Bbb{a}_i, \Bbb{b}_i)}{\sum_i max(\Bbb{a}_i, \Bbb{b}_i)}$

So for your examples of $t_1 = (1, 1, 0, 1), t_2 = (2, 0, 1, 1)$, the generalized Jaccard similarity index can be computed as follows:

$J(t_1, t_2) = \frac{1+0+0+1}{2+1+1+1} = 0.4$

Alternatively you can treat your bag-of-words vector as a binary vector, where a value $1$ indicates a words presence and $0$ indicates a words absence i.e. $t_1 = (1, 1, 0, 1), t_2 = (1, 0, 1, 1)$. From there, you can compute the original Jaccard similarity index:

$J(t_1, t_2) = \frac{2}{2+1+1} = 0.5$

timleathart
  • 3,940
  • 21
  • 35
  • 1
    @jkyh please see my edit to this answer -- I misinterpreted your situation initially and have added extra clarification – timleathart Dec 22 '16 at 02:02