This is an entity resolution aka record linkage aka data matching problem.
I would solve this by removing all of the non-alphabetical characters including numbers, casting into all uppercase and then employing a hierarchical match. First match up the exact cases and then move to a Levenshtein scoring between the fields. Make some sort of a decision about how large you will allow the Levenshtein or normalized Levenshtein score to get before you declare something a non-match.
Assign every row an id and when you have a match, reassign the lower of the IDs to both members of the match.
The Levenshtein distance algorithm is simple but brilliant (taken from here):
def levenshtein(a,b):
"Calculates the Levenshtein distance between a and b."
n, m = len(a), len(b)
if n > m:
# Make sure n <= m, to use O(min(n,m)) space
a,b = b,a
n,m = m,n
current = range(n+1)
for i in range(1,m+1):
previous, current = current, [i]+[0]*n
for j in range(1,n+1):
add, delete = previous[j]+1, current[j-1]+1
change = previous[j-1]
if a[j-1] != b[i-1]:
change = change + 1
current[j] = min(add, delete, change)
return current[n]
This Data Matching book is a good resource and is free for seven days on Amazon.
Nominally, this is an $n^2$ algorithm without exploiting some sorting efficiencies, so I would expect to have to use multiple cores on $2\times10^7$ rows. But this should run just fine on an 8 core AWS instance. It will eventually finish on a single core, but might take several hours.
Hope this helps!