How do I compare columns in different data frames?

Question

I would like to compare one column of a df with other df's. The columns are names and last names. I'd like to check if a person in one data frame is in another one.

Could you please indicate how you want the result to look like? Is it a df with names appearing in both dfs, and whether you also need anything else such as count, or matching column in df2 ,etc. Thanks! — The Lyrist, Jun 12 '18 at 22:39
check out https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.isin.html — oW_, Jun 12 '18 at 22:56
You could inner join the two data frames on the columns you care about and check if the number of rows in the result is positive. — dsaxton, Jul 13 '18 at 13:41
FYI, comparing on first and last name on any decently large set of names will end up with pain - lots of people have the same name! — Ken Syme, Jul 13 '18 at 20:31

score 42 · Accepted Answer · edited Nov 01 '21 at 00:15

42

If you want to check equal values on a certain column, let's say Name, you can merge both DataFrames to a new one:

mergedStuff = pd.merge(df1, df2, on=['Name'], how='inner')
mergedStuff.head()

I think this is more efficient and faster than where if you have a big data set.

edited Nov 01 '21 at 00:15

tdy

229
2
9

answered Jun 13 '18 at 11:09

Tarek

724
5
9

2

I think we want to use an inner join here and then check its shape. – dsaxton Jul 13 '18 at 13:43

score 28 · Answer 2 · edited Oct 31 '21 at 16:51

28

You can double check the exact number of common and different positions between two df by using isin and value_counts().

Like that:

df['your_column_name'].isin(df2['your_column_name']).value_counts()

Result:

True = common
False = different

edited Oct 31 '21 at 16:51

tdy

229
2
9

answered Jul 08 '19 at 08:40

QbS

381
3
4

1

This should be the answer in my opinion. – Superdooperhero Jan 16 '20 at 09:12
1

elegant, without creating a new df – arilwan Nov 06 '21 at 15:59

pcko1 · Answer 3 · 2019-07-28T19:37:29.863

22

df1.where(df1.values==df2.values).notna()

True entries show common elements. This also reveals the position of the common elements, unlike the solution with merge.

edited Jul 28 '19 at 19:37

answered Jun 12 '18 at 22:51

pcko1

3,940
2
15
29

what is df. in your answer? There are only df1 and df2 but no df – LearneR Jul 24 '19 at 10:43
when some values are NaN values, it shows False. for other cases OK. need to fillna first. hope there is a shortcut to compare both NaN as True. – ihightower May 03 '20 at 19:53

score 15 · Answer 4 · edited Feb 08 '21 at 01:59

Comparing values in two different columns

Using set, get unique values in each column. The intersection of these two sets will provide the unique values in both the columns.

Example:

df1 = pd.DataFrame({'c1': [1, 4, 7], 'c2': [2, 5, 1], 'c3': [3, 1, 1]})
 df2 = pd.DataFrame({'c4': [1, 4, 7], 'c2': [3, 5, 2], 'c3': [3, 7, 5]})
 set(df1['c2']).intersection(set(df2['c2']))

Output: {2, 5}

Comparing column names of two dataframes

Incase you are trying to compare the column names of two dataframes:

If df1 and df2 are the two dataframes: set(df1.columns).intersection(set(df2.columns))

This will provide the unique column names which are contained in both the dataframes.

Example:

df1 = pd.DataFrame({'c1': [1, 4, 7], 'c2': [2, 5, 1], 'c3': [3, 1, 1]})
df2 = pd.DataFrame({'c4': [1, 4, 7], 'c2': [3, 5, 2], 'c3': [3, 7, 5]})
set(df1.columns).intersection(set(df2.columns))

Output: {'c2', 'c3'}

I think the the question is about comparing the values in two different columns in different dataframes as question person wants to check if a person in one data frame is in another one. — Divyanshu Shekhar, Jun 13 '18 at 07:04
Thanks, I got the question wrong. I've updated the answer now. — aathiraks, Jun 13 '18 at 09:24

score 0 · Answer 5 · edited Sep 03 '21 at 21:49

0

You can get the whole common dataframe by using loc and isin.

df_common = df1.loc[df1['set1'].isin(df2['set2'])]

df_common now has only the rows which are the same col value in other dataframe.

edited Sep 03 '21 at 21:49

Ethan

1,633
9
24
39

answered Sep 03 '21 at 15:43

DOT

103
3

score 0 · Answer 6 · answered Jan 26 '22 at 08:52

@Hermes Morales your code will fail for this:

lst =["Juan","Pedro","Carlos"]
lst2=["Cabrera","Paredes", "Olivera"]
lst3 =["Juan","Pedro","Carlos","Joselo"]
lst4=["Cabrera","Olivera","Rubianes"]
df = pd.DataFrame(list(zip(lst, lst2)),
           columns =['Name', 'First_name'])
df1 = pd.DataFrame(list(zip(lst3, lst4)),
           columns =['Name', 'First_name'])

# I want to get a name and first_name which are there in df1 in a single row but not in df.
column1 = "Name"
column2= "First_name" 
first_name = "Olivera"
name ="Pedro"

My suggestion would be to consider both the boths while returning the answer.

def check_for_both_names(df, df1, column1, column2, first_name, name):
    """This function check if two paired elements are present in two different dataframes."""
    # check that the name is in both columns
    col1 = name in list(df[column1]) and name in list(df1[column1])
    # check that first_name is in both columns
    col2 = first_name in list(df[column2]) and first_name in list(df1[column2])
    # check that both name and first_name are in the same row in the first dataframe
    try:
        both = (df[df["First_name"] == first_name].index[0]) == (
            df[df["Name"] == name].index[0])
    except:
        # if name or first_name does not exist
        pass
    # check that both name and first_name are in the same row in the second dataframe
    try:
        both1 = (df1[df1["First_name"] == first_name].index[0]) == (
            df1[df1["Name"] == name].index[0])
    except:
        # if name or first_name does not exist
        pass
return col1 and col2 and both and both1

Please correct me if I'm wrong:)

score 0 · Answer 7 · answered Jul 04 '23 at 21:27

If your use case is where you have multiple data frames and you want to check whether or not the column headers are present in either of them, this piece of code might help.

consider you have df1,df2,df3 as the dataframes, make a variable to store their columns as :

df1_columns = df1.columns
df2_columns = df2.columns
df3_columns = df3.columns
optionally give the columns a name like
df1_columns.name = "df1"

Then use this function

# A code I prompted to generate a data frame to generate a sheet vs column presence table
# This shows if a column header is present across multiple data frames or not
import pandas as pd
def generate_presence_dataframe(*columns):
    """
    Generate a DataFrame to show the presence of attributes in each column.
Args:
    *columns: Variable number of pandas DataFrame columns.

Returns:
    A pandas DataFrame with the presence of attributes in each column.
    The column headers are based on the names of the passed columns,
    or generic names if the columns don't have names.
    The displayed columns are in the same order of the passed column arguments.
    The DataFrame is sorted based on the number of '1' values horizontally (across the rows).

&quot;&quot;&quot;
# Step 1: Convert the column(s) to set(s)
column_sets = [set(col) for col in columns]

# Step 2: Create a set of all unique attributes from the column(s)
all_attributes = sorted(list(set().union(*column_sets)))

# Step 3: Create a dictionary to store the presence of attributes in each column
presence_dict = {'Attributes': all_attributes}
for i, col in enumerate(columns):
    column_name = col.name if col.name else f'Column {i+1}'
    presence_dict[column_name] = [1 if attr in col else 0 for attr in all_attributes]

# Step 4: Create a DataFrame from the presence dictionary
presence_df = pd.DataFrame(presence_dict)

# Step 5: Sort the dataframe based on the number of '1' values horizontally (across the rows)
presence_df = presence_df.iloc[presence_df.iloc[:, 1:].sum(axis=1).sort_values(ascending=False).index]

# Reset the index
presence_df = presence_df.reset_index(drop=True)

return presence_df

You can call it like

columns_presence_df = generate_presence_dataframe(df1, df2, df3)
columns_presence_df

score 0 · Answer 8 · answered Jul 13 '18 at 17:58

0

Note that the columns of dataframes are data series. So if you take two columns as pandas series, you may compare them just like you would do with numpy arrays.

answered Jul 13 '18 at 17:58

Timur

81
2

score 0 · Answer 9 · answered Apr 17 '21 at 20:49

"I'd like to check if a person in one data frame is in another one."

The condition is for both name and first name be present in both dataframes and in the same row.

import pandas as pd
lst =["Juan","Pedro","Carlos"]
lst2=["Cabrera","Olivera","Paredes"]
lst3 =["Juan","Pedro","Carlos","Joselo"]
lst4=["Cabrera","Olivera","Rubianes"]
df = pd.DataFrame(list(zip(lst, lst2)),
           columns =['Name', 'First_name'])
df1 = pd.DataFrame(list(zip(lst3, lst4)),
           columns =['Name', 'First_name'])
column1 = "Name"
column2= "First_name"
def check_if_a_person_in_one_data_frame_is_in_another_one(df,df1,column1,column2,first_name,name):
    """This function check if two paired elements are present in two different dataframes."""
    #check that the name is in both columns
    col1=  name in list(df[column1]) and name in list(df1[column1])
    #check that first_name is in both columns
    col2 = first_name in list(df[column2]) and first_name in list(df1[column2])
    #check that both name and first_name are in the same row in the first dataframe
    try:
          both =(df[df["First_name"]==first_name].index[0]) == (df[df["Name"]==name].index[0])
    except:
          # if name or first_name does not exist
          pass
    #check that both name and first_name are in the same row in the second dataframe
    try:
          both =(df1[df1["First_name"]==first_name].index[0]) == (df1[df1["Name"]==name].index[0])
    except:
          # if name or first_name does not exist
          pass
return col1 and col2 and both


column1 = "Name"
column2= "First_name" 
first_name = "Cabrera"
name ="Juan"
check_if_a_person_in_one_data_frame_is_in_another_one(df,df1,column1,column2,first_name,name)
True

How do I compare columns in different data frames?

9 Answers9

optionally give the columns a name like

Linked