Calculating similarity where order matters

Question

How can I calculate a similarity (coefficient) where the order of the items matters and something like the Jaccard index would not be useful.

Specifically, I'm interested in comparing ingredients. Take a simplified apple pie ingredient list, for example:

Apples
Enriched Unbleached Flour
Palm Oil
High Fructose Corn Syrup
Salt
Eggs
Spices

And then compare that to:

The exact same ingredients, but in reverse order. Where spices is the top ingredient, and apples are the least used ingredient
The list, but with, palm oil replaced with vegetable oil and corn syrup replace with sugar

Even if you're naive about the fact that the replacements made in #2 are similar, the set of #2 is much more similar to the original than the set of #1. Is there an algorithm which would express that?

Erwan · Accepted Answer · 2019-08-03T00:43:31.840

1

Late answer for an interesting question:

How can I calculate a similarity (coefficient) where the order of the items matters

This is exactly what character-based approximate string matching measures do, since a string is an ordered list of characters. So the idea is to consider every element in the list as a character in a string and apply the algorithm. The main character-based measures are:

The Levenshtein edit distance, for which there are many available variations
Jaro-Winckler

I would recommend the former since it has a clearer interpretation and is probably more generally used.

edited Aug 03 '19 at 00:43

answered Aug 03 '19 at 00:24

Erwan

25,321
3
14
35

Great! Looks like the algorithms are generally designed around letters rather than multi-word ingredient phrases, but I can re-build something like the iterative approach example (https://www.python-course.eu/levenshtein_distance.php) to do phrase comparisons. – James S Aug 05 '19 at 15:41
Yes that's what I meant, you can replace "character" with any element in an ordered list. Levenshtein distance is quite intuitive since its value represents the minimum number of edits (insertion, removal, substitution) needed to transform one ordered list into the other. The algorithm is also a nice example of dynamic programming. Enjoy :) – Erwan Aug 05 '19 at 16:20

glhuilli · Answer 2 · 2019-03-05T23:01:16.383

Not sure if this exists already, but maybe you could build a new similarity metric (let's call it $js$) with a few requirements:

If lists $A$ and $B$ are exactly the same, then $js(A, B) = 1.0$
If lists $A$ and $B$ are completely different (no items in common), then $js(A, B) = 0.0$
If all elements match but are shuffled, then $js(A, B) = 0.5$
If $|A| \neq |B|$, then the similarity metric should consider the items of the longest list (e.g. if there are few extra ingredients in $A$, but $B$ is identical otherwise, it will still have some penalization)

Then you can build a method that basically does the following (e.g. in python):

def js(A, B):
    cummulative_score = 0
    longest_list = B
    shortest_list = A
    if len(A) > len(B):
        longest_list = A
        shortest_list = B

    for index, element in enumerate(longest_list):
        if index < len(shortest_list):
            if element == shortest_list[index]:
                cummulative_score += 1
            if element != shortest_list[index]:
                if element in shortest_list:
                    cummulative_score += 0.5
    return cummulative_score / len(longest_list)

Then you would have something like this:

In [10]: print(l)
['Apples', 'Enriched Unbleached Flour', 'Palm Oil', 'High Fructose Corn Syrup', 'Salt', 'Eggs', 'Spices']

In [11]: print(l1)
['Apples', 'Enriched Unbleached Flour', 'vegetable oil', 'sugar', 'Salt', 'Eggs', 'Spices']

In [12]: print(l2)
['Spices', 'Eggs', 'Salt', 'High Fructose Corn Syrup', 'Palm Oil', 'Enriched Unbleached Flour', 'Apples']

In [13]: print(l3)
['Enriched Unbleached Flour', 'Palm Oil', 'High Fructose Corn Syrup', 'Salt', 'Eggs', 'Spices', 'Apples']

In [14]: print(l4)
['Apples', 'Enriched Unbleached Flour', 'Palm Oil', 'High Fructose Corn Syrup', 'Salt', 'Eggs', 'Spices', 'Ice Cream']

In [15]: js(l, l1)
Out[15]: 0.7142857142857143

In [16]: js(l, l2)
Out[16]: 0.5714285714285714

In [17]: js(l, l3)
Out[17]: 0.5

In [18]: js(l, l4)
Out[18]: 0.875

In [19]: js(l, ['Nutella'])
Out[19]: 0.0

Calculating similarity where order matters

2 Answers2