How to merge two csv files, remove duplicates and create a new column in R?

Question

This is a very simple question, but I am unable to find my answer in other posts. For example: - Merging two dataframes, removing duplicates and aggregation in R - Merge two dataframes with repeated columns

Both csv files have the same column names (e.g., name, email, status, etc.). The first csv is a master list of names and emails. The second is a list of individuals who have RSVP'd to an event. I want to merge the two data frames, remove any duplicates, and then mutate to create a new column for RSVP == "yes/no". I imagine you use one of dplyr's functions in this problem, but unsure if it would be a full_join() or inner_join(). To give an example:

  status          names           email   company
1 invited     John Smith  [email protected] Company A
2 invited   Abbi Maureen  [email protected] Company B
3 invited       Sara Doe  [email protected] Company C
4 invited Maria Gonzalez [email protected] Company D
5 invited  Frank Russell [email protected] Company E

The second csv is a list of individuals who confirmed their attendance, their status marked with RSVP.

 status        names          email   company
1   RSVP Abbi Maureen [email protected] company B
2   RSVP   John Smith [email protected] Company A

I am stuck in determining how to best merge these two data frames, remove any duplicates, and then create a new column (i.e., RSVP yes no)? Would it be full_join and then mutate?

You can use `left_join` if the intention is to create a column in the first dataset — akrun, Nov 25 '19 at 18:54
`df1$rsvp <- df1$name %in% df2$name` I don't think you need to load complex foreign libraries for this. — cory, Nov 25 '19 at 18:59
It's not clear to me what the end result should be. What duplicates should be removed, and what dataset needs a new column? — Axeman, Nov 25 '19 at 19:03
@cory you could have two people with the same name. Better to go by email, which has to be unique. — iod, Nov 25 '19 at 19:06
The original intent was to merge the two data frames. Doing so would create duplicates of names (i.e., a person in the master list and the same who RSVP'd). I just tried and I'm not sure `left_join` solved that problem. — bjk127, Nov 25 '19 at 19:19
change name of col 'status' in second csv to 'rsvp' or 'response and then do 'left_join'. you can then filter on new col_name == "RSVP" to see who's coming — infominer, Nov 25 '19 at 19:31
I agree with cory and iod, you don't need a full left join, just create the column RSVP in your first dataframe using the following logic: df1$rsvp <- df1$email%in% df2$email — Shirin Yavari, Nov 25 '19 at 19:36

Joe Erinjeri · Answer 1 · 2019-11-27T01:13:13.040

So the way I think about this is that the invitations dataframe (lets call this df1) has most of the data you want, and you want to add specific new unique information from the RSVPs dataframe (lets call this df2). You need to identify a key to form a crosswalk between df1 and df2. In this case, the email is the unique key. Now, putting the solution together...

First get the key and new info you want to add

df2<-df2 %>%
  select(email, status)

Then do the left join, and mutate to create the new column with the RSVP

merged_df<-df1 %>%
  left_join(df2, by="email") %>%
  mutate(RSVP=if_else(status=="RSVP","yes","no"))

or you can do it all in one step with a right join

merge_df<-df2 %>%
  select(email, status) %>%
  right_join(df1, by="email") %>%
  mutate(RSVP=if_else(status=="RSVP","yes","no"))

Note that you can do both in one assignment with a `right_join()` — Dan Chaltiel, Nov 26 '19 at 08:40

How to merge two csv files, remove duplicates and create a new column in R?

1 Answers1