1

I have a large dataframe, and I want to basically create a "unique identifier" for every separate person. The relevant column is the "e-mail" column, but it's made difficult by the formatting: each person can have multiple e-mails. Example frame below:

Name of person ||| E-mail Address
'John Doe'     ||| '[email protected]'
'Bob Jones'    ||| '[email protected];[email protected]'
'Robert Jones' ||| '[email protected];[email protected]'
'Clara Bit'    ||| '[email protected]'
'John Doe'     ||| '[email protected];[email protected]'

I want to have a field to tell people apart as individuals based on the e-mails:

Name of person ||| person ID
'John Doe'         1
'Bob Jones'        2
'Robert Jones'     2
'Clara Bit'        3
'John Doe'         4

My brain is kind of blowing up figuring out how to do it using for loops, so I'm hoping there's an easier way (plus, I'm iterating over df.index a lot, which I'm told is bad form and is incredibly slow regardless). Is there a function that could do something if I made multiple e-mail columns with single e-mail elements?

Thank you!

EDIT: Apologies for the typo on the third line of e-mails, it has been fixed.

Jim Eisenberg
  • 1,490
  • 1
  • 9
  • 17

1 Answers1

4

Assuming there is a typo on the shared email, this is a multiple steps problem that involves pandas and networkx libraries, this is a network problem, and I took inspiration from these 2 questions network problem and splitting list problem:

(1) Spit emails into lists
(2) Explode email column
(3) Create edge lists of users with same emails
(4) Create network with that edgelist
(5) Extract the different subgraphs of the network that will represent your unique id's
(6) Assign those unique id's to the original people

(1) Spit emails into lists

import pandas as pd  
df = pd.DataFrame({'name':['John','Bob', 'Rob', 'Clara', 'John'], 'email':['[email protected]','[email protected];[email protected]','[email protected];[email protected]','[email protected]','[email protected];[email protected]']}) 
df['email_list'] = df['email'].str.split(';').tolist()

(2) Explode email column

df_emails = df['email_list'].apply(pd.Series).reset_index().melt(id_vars='index',value_name='email').dropna()[['index', 'email']].set_index('index')

(3) Create edge lists of users with same emails

df_emails['email_id'] = df_emails.groupby('email').ngroup()
df_emails = df_emails.reset_index()
network = df_emails.merge(df_emails, on='email_id').drop(columns=['email_id', 'email_x', 'email_y'])

(4) Create network with that edgelist

import networkx as nx
G = nx.from_pandas_edgelist(network, source='index_x', target='index_y')

(5) Extract the different subgraphs of the network that will represent your unique id's

l = [list(x.nodes()) for x in nx.connected_component_subgraphs(G)]

(6) Assign those unique id's to the original people

d = dict((k, i) for i in range(len(l)) for k in l[i])
df['unique_id'] = df.index.map(d)

With the final result being:

    name    email   email_list  unique_id
0   John    [email protected] [[email protected]]   0
1   Bob [email protected];[email protected]   [[email protected], [email protected]]    1
2   Rob [email protected];[email protected]  [[email protected], [email protected]]   1
3   Clara   [email protected]  [[email protected]]    2
4   John    [email protected];[email protected] [[email protected], [email protected]]  3
Franco Piccolo
  • 6,845
  • 8
  • 34
  • 52