6

I'd like to calculate the similarity between two sets using Jaccard but temper the results using the relative frequency of each item within a corpus.

Jaccard is defined as the magnitude of the intersection of the two sets divided by the magnitude of the union of them both.

$jaccard(A,B)=|A⋂B||A⋃B|$

If I use inverse document frequency (the log of the number of documents divided by the frequency of the item) ...

$idf(i)=log|D|f(i)+1$

$|D|$ is the number of documents

$|f(i)|$ is the frequency of the item in the documents.

... can I define my weighted Jaccard similarity function as the sum of the IDFs of the items in the intersections divided by the sum of the IDFs of the union? (Sorry describing this in LaTeX reaches the limits of my knowledge of notation.) Will this scale the similarity appropriately? Are a collection of weights better suited to a cosine similarity?

1 Answers1

2

The following paper from experts at Google:

http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/36928.pdf

Defines Weighted Jaccard between two vectors with non-negative entries as:

enter image description here

Using the Inverse Document Frequency is interesting, but I am not sure you could really call this a weighted Jaccard?

Here is another paper on a related topic as well:

http://theory.stanford.edu/~sergei/papers/soda10-jaccard.pdf

Hope this helps!

Jake Drew
  • 121
  • 2