A Difficult Deduplication

Question

I have a large dataframe, and I want to basically create a "unique identifier" for every separate person. The relevant column is the "e-mail" column, but it's made difficult by the formatting: each person can have multiple e-mails. Example frame below:

Name of person ||| E-mail Address
'John Doe'     ||| '[email protected]'
'Bob Jones'    ||| '[email protected];[email protected]'
'Robert Jones' ||| '[email protected];[email protected]'
'Clara Bit'    ||| '[email protected]'
'John Doe'     ||| '[email protected];[email protected]'

I want to have a field to tell people apart as individuals based on the e-mails:

Name of person ||| person ID
'John Doe'         1
'Bob Jones'        2
'Robert Jones'     2
'Clara Bit'        3
'John Doe'         4

My brain is kind of blowing up figuring out how to do it using for loops, so I'm hoping there's an easier way (plus, I'm iterating over df.index a lot, which I'm told is bad form and is incredibly slow regardless). Is there a function that could do something if I made multiple e-mail columns with single e-mail elements?

Thank you!

EDIT: Apologies for the typo on the third line of e-mails, it has been fixed.

with your current example, Bob and Robert Jones _don't_ share an email, is one of Robert's supposed to be `[email protected]`? — wpercy, Oct 25 '18 at 16:12
It's unclear how you grouped `Bob Jones` with `Robert Jones`, none of the field in their `email` column matched. — Rocky Li, Oct 25 '18 at 16:13
presumably the lack of 'j' in the latter aboy.net email is a typo — TravisThomas, Oct 25 '18 at 16:17
I would use a dict that maps email addresses to the UID. Checking if the email is among the keys of the dict is fast and trivial. — TravisThomas, Oct 25 '18 at 16:51

Franco Piccolo · Accepted Answer · 2018-10-26T05:38:07.677

Assuming there is a typo on the shared email, this is a multiple steps problem that involves pandas and networkx libraries, this is a network problem, and I took inspiration from these 2 questions network problem and splitting list problem:

(1) Spit emails into lists
(2) Explode email column
(3) Create edge lists of users with same emails
(4) Create network with that edgelist
(5) Extract the different subgraphs of the network that will represent your unique id's
(6) Assign those unique id's to the original people

(1) Spit emails into lists

import pandas as pd  
df = pd.DataFrame({'name':['John','Bob', 'Rob', 'Clara', 'John'], 'email':['[email protected]','[email protected];[email protected]','[email protected];[email protected]','[email protected]','[email protected];[email protected]']}) 
df['email_list'] = df['email'].str.split(';').tolist()

(2) Explode email column

df_emails = df['email_list'].apply(pd.Series).reset_index().melt(id_vars='index',value_name='email').dropna()[['index', 'email']].set_index('index')

(3) Create edge lists of users with same emails

df_emails['email_id'] = df_emails.groupby('email').ngroup()
df_emails = df_emails.reset_index()
network = df_emails.merge(df_emails, on='email_id').drop(columns=['email_id', 'email_x', 'email_y'])

(4) Create network with that edgelist

import networkx as nx
G = nx.from_pandas_edgelist(network, source='index_x', target='index_y')

(5) Extract the different subgraphs of the network that will represent your unique id's

l = [list(x.nodes()) for x in nx.connected_component_subgraphs(G)]

(6) Assign those unique id's to the original people

d = dict((k, i) for i in range(len(l)) for k in l[i])
df['unique_id'] = df.index.map(d)

With the final result being:

    name    email   email_list  unique_id
0   John    [email protected] [[email protected]]   0
1   Bob [email protected];[email protected]   [[email protected], [email protected]]    1
2   Rob [email protected];[email protected]  [[email protected], [email protected]]   1
3   Clara   [email protected]  [[email protected]]    2
4   John    [email protected];[email protected] [[email protected], [email protected]]  3

Amazing answer, thank you so much! Tested that it works, now I just need to sit down and figure out why :P — Jim Eisenberg, Oct 26 '18 at 08:40

A Difficult Deduplication

1 Answers1