Need to generate Customer ID for transaction Data

Question

I have customer transaction data over a period of time, where they have used multiple emails and phone number over the time. How I can associate all emails and phone numbers to one ID. My current dataframe is in following format

Name               Phone        Email
Ram                123456788    [email protected]
Ram Fernandes      123456788    [email protected]
Ram F              213456789    [email protected]
Ram Fern           213456789    [email protected]
Matthews           123456798    [email protected]

Output should look like

ID      Name
Cust 1  Ram
Cust 1  Ram Fernandes
Cust 1  Ram F
Cust 1  Ram Fern
Cust 2  Matthews

Hi! Please see https://stackoverflow.com/questions/42921674/assign-unique-id-based-on-two-columns. I believe it accomplishes what you need ^_^ — OctoCatKnows, Jun 30 '19 at 15:36
@JorisChau In this case Record 1 and 2 have same phone different email, so both are same customer. Between Record 2 and 3, same email, so now 1,2 and 3 are same customer. Now 3 and 4 have number. so now all 1,2,3 and 4 are same customer. — Pulkit, Jun 30 '19 at 16:51
@BuffsGrad16 its different as student value never changes value for same id in that example. it can vary in this example — Pulkit, Jun 30 '19 at 16:53
I think your problem is, in essence, very similar to the one described [here](https://stackoverflow.com/q/56740990/5793905). Maybe you can adapt that graph-based approach. — Alexis, Jun 30 '19 at 17:08

Ronak Shah · Accepted Answer · 2019-07-02T03:02:49.807

0

We can extract the first part of email before the word "@" to get the firstName and then use for loop to see if any of the previous entry has the same firstName or Phone. If there is match use that ID or increment the ID variable.

df$ID <- "Cust 1"
ind <- 1
df$firstName <- sub("(.*)@.*", "\\1", df$Email)

for (i in 2:nrow(df)) {
   if (with(df, is.na(Phone[i]) | Phone[i] == "" | Phone[i] == 0 | 
      is.na(firstName[i]) | firstName[i] == ""))
       df$ID[i] <- NA
   else {
   #check if current Phone or Firstname is similar to any of the previous entry
     inds <- with(df, Phone[i] == Phone[1:(i-1)] | 
                      firstName[i] == firstName[1:(i-1)])
     if (any(inds)) 
     #Get the ID of similar entry 
       df$ID[i] <- df$ID[which.max(inds)]
     else {
     #If there is no match then give a new ID
       ind = ind + 1
       df$ID[i] <- paste("Cust", ind)
     }
   }
}


df
#          Name     Phone         Email firstName     ID
#1          Ram 123456788 [email protected]       ram Cust 1
#2 RamFernandes 123456788   [email protected]        r1 Cust 1
#3         RamF 213456789   [email protected]        r1 Cust 1
#4      RamFern 213456789  [email protected]       ram Cust 1
#5     Matthews 123456798 [email protected]      Matt Cust 2

edited Jul 02 '19 at 03:02

answered Jun 30 '19 at 13:17

Ronak Shah

377,200
20
156
213

first part of the email doesnt necessary have to be same name, it could be [email protected] or [email protected] – Pulkit Jun 30 '19 at 13:21
@Pulkit then how do you plan to identify similar names? – Ronak Shah Jun 30 '19 at 13:22
I want to associate them like record 1 and 2, phone number is same but different email. Now for record 2 and 3 email is same and so on similarly for record 4. So all 1,2,3,4 record get same ID. I'm not concerned about getting the name correct, only give them an id – Pulkit Jun 30 '19 at 13:26
@Pulkit I see. I have updated the answer accordingly. Can you check if it works for your case now? – Ronak Shah Jul 01 '19 at 03:08
it works great unless I have blanks or 0 in email or phone number, I would like to remove them, probably another nested if else might help – Pulkit Jul 01 '19 at 18:16
@Pulkit okay..I added another check and replace the `ID`s with `NA` for those values. If not needed you can remove them later. – Ronak Shah Jul 02 '19 at 03:03

Need to generate Customer ID for transaction Data

1 Answers1