What's an efficient way to compare and group millions of store names?

Question

I'm a total amateur as far as data science goes, and I'm trying to figure out a way to do some string comparison on a large dataset.

I've a Google BigQuery table storing merchant transactions, but the store names are all over the board. For example, there can be 'Wal-Mart Super Center' and 'Wal-Mart SC #1234', or 'McDonalds F2222' and 'McDonalds #321'.

What I need to do is group ALL 'Wal-mart' and 'McDonalds' and whatever else. My first approach was doing a recursive reg-ex check, but that took forever and eventually timed-out.

What's the best approach for doing that with a table of 20 million+ rows? I'm open to trying out any technology that would fit this job.

AN6U5 · Accepted Answer · 2015-08-20T22:26:07.967

7

This is an entity resolution aka record linkage aka data matching problem.

I would solve this by removing all of the non-alphabetical characters including numbers, casting into all uppercase and then employing a hierarchical match. First match up the exact cases and then move to a Levenshtein scoring between the fields. Make some sort of a decision about how large you will allow the Levenshtein or normalized Levenshtein score to get before you declare something a non-match.

Assign every row an id and when you have a match, reassign the lower of the IDs to both members of the match.

The Levenshtein distance algorithm is simple but brilliant (taken from here):

def levenshtein(a,b):
    "Calculates the Levenshtein distance between a and b."
    n, m = len(a), len(b)
    if n > m:
        # Make sure n <= m, to use O(min(n,m)) space
        a,b = b,a
        n,m = m,n

    current = range(n+1)
    for i in range(1,m+1):
        previous, current = current, [i]+[0]*n
        for j in range(1,n+1):
            add, delete = previous[j]+1, current[j-1]+1
            change = previous[j-1]
            if a[j-1] != b[i-1]:
                change = change + 1
            current[j] = min(add, delete, change)

    return current[n]

This Data Matching book is a good resource and is free for seven days on Amazon.

Nominally, this is an $n^2$ algorithm without exploiting some sorting efficiencies, so I would expect to have to use multiple cores on $2\times10^7$ rows. But this should run just fine on an 8 core AWS instance. It will eventually finish on a single core, but might take several hours.

Hope this helps!

edited Aug 20 '15 at 22:26

answered Aug 20 '15 at 22:19

AN6U5

6,808
1
24
42

2

Thanks! Levenshtein is causing some issues because there are merchants like 'Google Marketplace' and 'Goodwill Marketplace' which have a lower score than even 'Wal-mart' and 'walgreens'. I think I can probably whitelist the outliers though. And double thanks for the book recommendation. – TerryMatula Aug 21 '15 at 15:57
1

That's a tough one as normalized Levenshtein is less than $.16$ which is a cutoff I often use. I guess you could also try extracting the words through space delineation and then comparing all words to all words in some smart way. This is a tough problem if you are going for high confidence! Another option is see where you get with exact match and then use a mechanical turk (or yourself) for the last couple million. – AN6U5 Aug 21 '15 at 16:07
Based on the examples it seems like it should be front-weighted in some way, as names that should be considered equivalent differ near the end but never im the beginning. – Adam Bittlingmayer Oct 18 '15 at 19:13
For example I would not remove a number from the beginning. Or at least look to see all the data with numbers in the beginning. – Adam Bittlingmayer Oct 18 '15 at 19:18
@Adam M. B., Levenshtein is a "base algorithm" and there are many derivations of it that have been tailored for specific advantages and cases. I haven't found prioritizing beginnings of words over their ends to be effective because my data often contains human error and typos can occur anywhere, but some datasets may benefit. There are a number of modifications to Levenshtein in the book that I referenced. I personally use an algo that only allows for an edit distance error of 1 or less. A key advantage being that it requires a high level of match and scales like $n$ rather than $n^2$. – AN6U5 Oct 19 '15 at 16:46
All good points. It's just a suggestion but without seeing the data it is hard to draw specific conclusions. The thing about frontweighting Levenshtein (as opposed to other optimisations) is that it adds some of the benefit of the other answer (alphasort), which I think has merit. – Adam Bittlingmayer Oct 19 '15 at 17:09

image_doctor · Answer 2 · 2015-08-21T11:17:20.473

I'd be really tempted to be lazy and apply some old technology for a quick and dirty solution, with no programming, using the linux sort command. This will give you a lexicographically sorted list.

If the store names are not the first field, if just reorder them or tell sort to use a different field via the -k switch.

Save the data to a plain CSV text file and then sort them:

$sort myStores.csv > sortedByStore.csv

You can give sort a hand by allocating it plenty of memory, 16GB in this case:

$sort -S16G myStores.csv > sortedByStore.csv

You could go further and produce a list of unique store names and counts of instances for them to help you get a handle on what the data looks like:

$sort -S16G myStores.csv  | cut -f1 -d, | uniq -c > storeIdsAndCounts.csv

Or to avoid resorting and have only the unique IDs:

$cat sortedByStore.csv   | cut -f1 -d, | uniq  > storeIds.csv

This might not be what the OP is looking for, but +1 for laziness! — eliasah, Oct 18 '15 at 16:41

What's an efficient way to compare and group millions of store names?

2 Answers2