Cosine similarity vs The Levenshtein distance

Question

I wanted to know what is the difference between them and in what situations they work best?

As per my understanding:

Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them. The cosine of 0° is 1, and it is less than 1 for any angle in the interval (0,π] radians.

The Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits

When would one use Cosine similarity over The Levenshtein distance?

score 11 · Accepted Answer · answered Nov 18 '19 at 13:19

As mentioned in other answers, traditionally cosine is used to measure similarity between vectors whereas Levenshtein is used as a string similarity measure, i.e. measuring the distance between sequences of characters.

Nevertheless they both can be used in non-traditional settings and are indeed comparable:

the vectors compared with cosine can for instance contain frequencies of characters or characters n-grams, hence making it a string similarity measure
one can replace the sequence of characters with a sequence of strings or a sequence of n-grams, thus making Levenshtein a more general distance measure.

The main conceptual difference between Cosine and Levenshtein is that the former assumes a "bag-of-words" vector representation, i.e. compares unordered sets, whereas the latter takes into account the order of the elements in the sequences.

In the context of comparing sequences of words many combinations are possible. In case that's what you're looking for you might be interested in this paper: https://www.aclweb.org/anthology/C08-1075/ (full disclosure: I'm one of the authors).

score 3 · Answer 2 · answered Aug 26 '22 at 16:09

I think the answers you got are technically correct, but don't address the big picture.

In the data science world, cosine similarity is mainly used for documents which have been encoded by an embedding. Documents could be anything from a single sentence or a tweet, to a paper with dozens of pages of text.

Embeddings include things like doc2vec, BERT, and similar, and they try to reflect some level of semantic knowledge into their encoding. That is, words that tend to have similar meanings will end up close together in the high-dimensional embedding space. And documents with similar contexts will also end up close together in this space.

So, to answer your question:

Levenshtein distance has no knowledge of semantics, it's simply an edit distance and nothing more. And in general you only use it if you have no other choice. For example, compare:

3000 N Main Street
3000 N Maan Street
3001 N Main Street
9000 N Main Street
3000 S Main Street

All of the lines after the first are an edit distance of 1 from the first line. That is, one character is changed. As you can see, which character is changed -- in this context -- can make a HUGE difference so using Levenshtein on the entire street address is useless.

So if "similar" means "would type nearly the same text" it might be useful, though it's hard to apply well and if you're looking for things like "fat finger" mistakes (typos) you might want to use a different edit distance that accounts for swaps of two adjacent characters.

Cosine similarity (where "similarity" is the inverse of "distance") is in general used on embeddings. The Bag of Words approach that the accepted answer uses for pedagogical purposes is clever but I've never seen or heard of it before now. It does allow you to use Cosine distance as an approximation of editing distance but it's really not used that way in practice.

In practice, you're using Cosine distance on an embedding that encodes semantic information: you're looking for words or documents (collections of words) that are using similar words in similar contexts. This would not be useful for the street address example, above, but you could imagine it being very useful on news articles or scientific papers, etc.

You could use Euclidean distance in the embedding space -- comparing the vector for each document directly -- but there can be issues with magnitude. And cosine similarity measures only the relative directions of the documents, not their magnitude, which is in general more useful and more what you expect when you want to compare two documents in terms of their "topic" or "meaning", etc.

So if "similar" means "talking about something similar or in a similar way" than you'll probably end up using a Cosine similarity measure with an embedding.

The accepted answer is technically creating an embedding, but in general I think the term "embedding" in data science is referring to something like doc2vec, BERT, GLoVE, etc, which reflect co-occurrences and other factors from which a semantic-like quality emerges.

score 1 · Answer 3 · answered Jan 05 '22 at 16:41

Cosine similarity uses vectors and can calculate similarity for sets and multisets (=bags). If used for similarity of sequences (of characters, words, sentences, lines, ...) the comparison is unordered and each kind of element is a feature = dimension in the vector space. Thus the letters of the word 'banana' are transformed into a set [a, b, n] or a bag {a: 3, b: 1, n: 2}, where the set can be thought as bag {a: 1, b: 1, n: 1} and the same calculation can be used. Each character is treated as a dimension of the vectors. Thus with supporting Unicode the vectorspace can have potentially 0x10FFFF ~ 1.1 million dimensions, but for comparison of two strings you need only a subset of size <= len1 + len2. That's implemented as sparse vector. To bring some sequential order into cosine similarity applied to sequences, we can use 2-grams or 3-grams. This can be very efficient for searching similar words in large dictionaries as candidates for spelling correction, e.g. limit the search to minimal similarity 0.7, or get the top 20 similar words.

Out of the candidates you can use the slower, but more precise Levenshtein or LCS.

score 1 · Answer 4 · answered Nov 18 '19 at 09:15

1

The first one is for computing the similarity between objects considering their representations as vectors. The second one is for computing the similarity between sequences of characters.

answered Nov 18 '19 at 09:15

Christos Karatsalos

830
4
12

score 0 · Answer 5 · answered Nov 18 '19 at 11:06

To answer directly to your question, I would say that one could use Cosine similarity when dealing with vectors (for instance the distance between (1,2,3) and (4,5,6)) and one could use the Levenshtein distance when dealing with strings ("distance" between "aaaaa" and "aaaba").

Concretely they don't really apply to the same context and are not used for the same applications. If you want to test if two different piece of texts are quite similar, it could be reasonable to use the Levenshtein distance. If you want to know if two vectors are quite similar to each other in a 3 dimensional space, it might be a good idea to use the cosine similarity.

Cosine similarity vs The Levenshtein distance

5 Answers5

Linked