0

I have customer transaction data over a period of time, where they have used multiple emails and phone number over the time. How I can associate all emails and phone numbers to one ID. My current dataframe is in following format

Name               Phone        Email
Ram                123456788    [email protected]
Ram Fernandes      123456788    [email protected]
Ram F              213456789    [email protected]
Ram Fern           213456789    [email protected]
Matthews           123456798    [email protected]

Output should look like

ID      Name
Cust 1  Ram
Cust 1  Ram Fernandes
Cust 1  Ram F
Cust 1  Ram Fern
Cust 2  Matthews

Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
Pulkit
  • 3
  • 2
  • What rules are you applying to define a single customer ID? – Joris C. Jun 30 '19 at 15:12
  • Hi! Please see https://stackoverflow.com/questions/42921674/assign-unique-id-based-on-two-columns. I believe it accomplishes what you need ^_^ – OctoCatKnows Jun 30 '19 at 15:36
  • @JorisChau In this case Record 1 and 2 have same phone different email, so both are same customer. Between Record 2 and 3, same email, so now 1,2 and 3 are same customer. Now 3 and 4 have number. so now all 1,2,3 and 4 are same customer. – Pulkit Jun 30 '19 at 16:51
  • @BuffsGrad16 its different as student value never changes value for same id in that example. it can vary in this example – Pulkit Jun 30 '19 at 16:53
  • 2
    I think your problem is, in essence, very similar to the one described [here](https://stackoverflow.com/q/56740990/5793905). Maybe you can adapt that graph-based approach. – Alexis Jun 30 '19 at 17:08
  • @Alexis Thanks a lot. Indeed its very similar. – Pulkit Jun 30 '19 at 20:28

1 Answers1

0

We can extract the first part of email before the word "@" to get the firstName and then use for loop to see if any of the previous entry has the same firstName or Phone. If there is match use that ID or increment the ID variable.

df$ID <- "Cust 1"
ind <- 1
df$firstName <- sub("(.*)@.*", "\\1", df$Email)

for (i in 2:nrow(df)) {
   if (with(df, is.na(Phone[i]) | Phone[i] == "" | Phone[i] == 0 | 
      is.na(firstName[i]) | firstName[i] == ""))
       df$ID[i] <- NA
   else {
   #check if current Phone or Firstname is similar to any of the previous entry
     inds <- with(df, Phone[i] == Phone[1:(i-1)] | 
                      firstName[i] == firstName[1:(i-1)])
     if (any(inds)) 
     #Get the ID of similar entry 
       df$ID[i] <- df$ID[which.max(inds)]
     else {
     #If there is no match then give a new ID
       ind = ind + 1
       df$ID[i] <- paste("Cust", ind)
     }
   }
}


df
#          Name     Phone         Email firstName     ID
#1          Ram 123456788 [email protected]       ram Cust 1
#2 RamFernandes 123456788   [email protected]        r1 Cust 1
#3         RamF 213456789   [email protected]        r1 Cust 1
#4      RamFern 213456789  [email protected]       ram Cust 1
#5     Matthews 123456798 [email protected]      Matt Cust 2
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • first part of the email doesnt necessary have to be same name, it could be [email protected] or [email protected] – Pulkit Jun 30 '19 at 13:21
  • @Pulkit then how do you plan to identify similar names? – Ronak Shah Jun 30 '19 at 13:22
  • I want to associate them like record 1 and 2, phone number is same but different email. Now for record 2 and 3 email is same and so on similarly for record 4. So all 1,2,3,4 record get same ID. I'm not concerned about getting the name correct, only give them an id – Pulkit Jun 30 '19 at 13:26
  • @Pulkit I see. I have updated the answer accordingly. Can you check if it works for your case now? – Ronak Shah Jul 01 '19 at 03:08
  • it works great unless I have blanks or 0 in email or phone number, I would like to remove them, probably another nested if else might help – Pulkit Jul 01 '19 at 18:16
  • @Pulkit okay..I added another check and replace the `ID`s with `NA` for those values. If not needed you can remove them later. – Ronak Shah Jul 02 '19 at 03:03