In fact your problem is graph separation for components. In your case vertices of graphs are persons. Based on attribute information, i.e. e-mail and phone number, you can establish relationships which are edges.
It looks like simple methods like paste
or duplicate
or group_by
are not effective as you can have rather complicated paths. As you explained however person D and person E have completely different contacts, in fact they are connected through person C hence should have the same ID.
Or in other words some person regesterd on site with e-mail A and mobile B. Then he lost the phone. And registered with mobile C. Then he forgot his password and registered with e-mail D. In the end we have the person with e-mail D and and mobile C. For some unknown reason he registered by different names.
You may have even more complicated relationship pathsway.
The algorithm below is using igraph
to make an undirected graph based on adjacency matrix created on your condition. After it identifies not connected components, extract it and merge with initial data.frame
. As there was not enough data in your example the simulation was used.
Simulated Input:
name tel email
1 AAA 222 [email protected]
2 BBB 555 [email protected]
3 CCC 333 [email protected]
4 DDD 666 [email protected]
5 EEE 666 [email protected]
6 FFF 111 [email protected]
7 GGG 444 [email protected]
8 HHH 666 [email protected]
9 III 444 [email protected]
10 JJJ 333 [email protected]
Code
library(igraph)
set.seed(123)
n <- 10
# simulation
df <- data.frame(
name = sapply(1:n, function(i) paste0(rep(LETTERS[i], 3), collapse = "")),
tel = sample(1:6, n, replace = TRUE) * 111,
email = paste0(sample(LETTERS[1:6], n, replace = TRUE), "@xy.com")
)
# adjacency matrix preparation
df1 <- expand.grid(df$name, df$name)
names(df1) <- c("name_x", "name_y")
df1 <- merge(df1, df, by.x = "name_x", by.y = "name")
df1 <- merge(df1, df, by.x = "name_y", by.y = "name")
df1$con <- ifelse(with(df1, tel.x == tel.y | email.x == email.y), 1, 0)
stats::reshape(df1[, c(1, 2, 7)], idvar = "name_x", timevar = "con", direction = "wide")
#v.names = , timevar = "numbers", direction = "wide")
library(igraph)
library(reshape2)
m <- dcast(df1[, c(1, 2, 7)], name_y ~ name_x)
rownames(m) <- m[, 1]
m[, 1] <- NULL
m <- as.matrix(m)
diag(m) <- 0
# graph creation
g1 <- graph_from_adjacency_matrix(m, mode = "undirected")
gcmps <- groups(components(g1))
# groups extraction
ids <- unlist(mapply(function(x, y) paste0(x, "_", y), seq_along(gcmps), gcmps))
df_ids <- as.data.frame(t(sapply(ids, function(x) unlist(strsplit(x, "_")))))
names(df_ids) <- c("id", "name")
# data merging
result <- merge(df, df_ids)
result
Output:
name tel email
1 AAA 222 [email protected]
2 BBB 555 [email protected]
3 CCC 333 [email protected]
4 DDD 666 [email protected]
5 EEE 666 [email protected]
6 FFF 111 [email protected]
7 GGG 444 [email protected]
8 HHH 666 [email protected]
9 III 444 [email protected]
10 JJJ 333 [email protected]
Relationship Graph (only first letters of name were taken)
