I would like to compare one column of a df with other df's. The columns are names and last names. I'd like to check if a person in one data frame is in another one.
-
Could you please indicate how you want the result to look like? Is it a df with names appearing in both dfs, and whether you also need anything else such as count, or matching column in df2 ,etc. Thanks! – The Lyrist Jun 12 '18 at 22:39
-
check out https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.isin.html – oW_ Jun 12 '18 at 22:56
-
You could inner join the two data frames on the columns you care about and check if the number of rows in the result is positive. – dsaxton Jul 13 '18 at 13:41
-
FYI, comparing on first and last name on any decently large set of names will end up with pain - lots of people have the same name! – Ken Syme Jul 13 '18 at 20:31
9 Answers
If you want to check equal values on a certain column, let's say Name
, you can merge
both DataFrames to a new one:
mergedStuff = pd.merge(df1, df2, on=['Name'], how='inner')
mergedStuff.head()
I think this is more efficient and faster than where
if you have a big data set.
You can double check the exact number of common and different positions between two df by using isin
and value_counts()
.
Like that:
df['your_column_name'].isin(df2['your_column_name']).value_counts()
Result:
True
= common
False
= different
df1.where(df1.values==df2.values).notna()
True
entries show common elements. This also reveals the position of the common elements, unlike the solution with merge
.

- 3,940
- 2
- 15
- 29
-
-
when some values are NaN values, it shows False. for other cases OK. need to fillna first. hope there is a shortcut to compare both NaN as True. – ihightower May 03 '20 at 19:53
Comparing values in two different columns
Using set, get unique values in each column. The intersection of these two sets will provide the unique values in both the columns.
Example:
df1 = pd.DataFrame({'c1': [1, 4, 7], 'c2': [2, 5, 1], 'c3': [3, 1, 1]})
df2 = pd.DataFrame({'c4': [1, 4, 7], 'c2': [3, 5, 2], 'c3': [3, 7, 5]})
set(df1['c2']).intersection(set(df2['c2']))
Output:
{2, 5}
Comparing column names of two dataframes
Incase you are trying to compare the column names of two dataframes:
If df1
and df2
are the two dataframes:
set(df1.columns).intersection(set(df2.columns))
This will provide the unique column names which are contained in both the dataframes.
Example:
df1 = pd.DataFrame({'c1': [1, 4, 7], 'c2': [2, 5, 1], 'c3': [3, 1, 1]})
df2 = pd.DataFrame({'c4': [1, 4, 7], 'c2': [3, 5, 2], 'c3': [3, 7, 5]})
set(df1.columns).intersection(set(df2.columns))
Output:
{'c2', 'c3'}
-
2I think the the question is about comparing the values in two different columns in different dataframes as question person wants to check if a person in one data frame is in another one. – Divyanshu Shekhar Jun 13 '18 at 07:04
-
1
@Hermes Morales your code will fail for this:
lst =["Juan","Pedro","Carlos"]
lst2=["Cabrera","Paredes", "Olivera"]
lst3 =["Juan","Pedro","Carlos","Joselo"]
lst4=["Cabrera","Olivera","Rubianes"]
df = pd.DataFrame(list(zip(lst, lst2)),
columns =['Name', 'First_name'])
df1 = pd.DataFrame(list(zip(lst3, lst4)),
columns =['Name', 'First_name'])
# I want to get a name and first_name which are there in df1 in a single row but not in df.
column1 = "Name"
column2= "First_name"
first_name = "Olivera"
name ="Pedro"
My suggestion would be to consider both the boths
while returning the answer.
def check_for_both_names(df, df1, column1, column2, first_name, name):
"""This function check if two paired elements are present in two different dataframes."""
# check that the name is in both columns
col1 = name in list(df[column1]) and name in list(df1[column1])
# check that first_name is in both columns
col2 = first_name in list(df[column2]) and first_name in list(df1[column2])
# check that both name and first_name are in the same row in the first dataframe
try:
both = (df[df["First_name"] == first_name].index[0]) == (
df[df["Name"] == name].index[0])
except:
# if name or first_name does not exist
pass
# check that both name and first_name are in the same row in the second dataframe
try:
both1 = (df1[df1["First_name"] == first_name].index[0]) == (
df1[df1["Name"] == name].index[0])
except:
# if name or first_name does not exist
pass
return col1 and col2 and both and both1
Please correct me if I'm wrong:)

- 3
- 3
If your use case is where you have multiple data frames and you want to check whether or not the column headers are present in either of them, this piece of code might help.
consider you have df1
,df2
,df3
as the dataframes, make a variable to store their columns as :
df1_columns = df1.columns
df2_columns = df2.columns
df3_columns = df3.columns
optionally give the columns a name like
df1_columns.name = "df1"
Then use this function
# A code I prompted to generate a data frame to generate a sheet vs column presence table
# This shows if a column header is present across multiple data frames or not
import pandas as pd
def generate_presence_dataframe(*columns):
"""
Generate a DataFrame to show the presence of attributes in each column.
Args:
*columns: Variable number of pandas DataFrame columns.
Returns:
A pandas DataFrame with the presence of attributes in each column.
The column headers are based on the names of the passed columns,
or generic names if the columns don't have names.
The displayed columns are in the same order of the passed column arguments.
The DataFrame is sorted based on the number of '1' values horizontally (across the rows).
"""
# Step 1: Convert the column(s) to set(s)
column_sets = [set(col) for col in columns]
# Step 2: Create a set of all unique attributes from the column(s)
all_attributes = sorted(list(set().union(*column_sets)))
# Step 3: Create a dictionary to store the presence of attributes in each column
presence_dict = {'Attributes': all_attributes}
for i, col in enumerate(columns):
column_name = col.name if col.name else f'Column {i+1}'
presence_dict[column_name] = [1 if attr in col else 0 for attr in all_attributes]
# Step 4: Create a DataFrame from the presence dictionary
presence_df = pd.DataFrame(presence_dict)
# Step 5: Sort the dataframe based on the number of '1' values horizontally (across the rows)
presence_df = presence_df.iloc[presence_df.iloc[:, 1:].sum(axis=1).sort_values(ascending=False).index]
# Reset the index
presence_df = presence_df.reset_index(drop=True)
return presence_df
You can call it like
columns_presence_df = generate_presence_dataframe(df1, df2, df3)
columns_presence_df

- 1
Note that the columns of dataframes are data series. So if you take two columns as pandas series, you may compare them just like you would do with numpy arrays.

- 81
- 2
"I'd like to check if a person in one data frame is in another one."
The condition is for both name and first name be present in both dataframes and in the same row.
import pandas as pd
lst =["Juan","Pedro","Carlos"]
lst2=["Cabrera","Olivera","Paredes"]
lst3 =["Juan","Pedro","Carlos","Joselo"]
lst4=["Cabrera","Olivera","Rubianes"]
df = pd.DataFrame(list(zip(lst, lst2)),
columns =['Name', 'First_name'])
df1 = pd.DataFrame(list(zip(lst3, lst4)),
columns =['Name', 'First_name'])
column1 = "Name"
column2= "First_name"
def check_if_a_person_in_one_data_frame_is_in_another_one(df,df1,column1,column2,first_name,name):
"""This function check if two paired elements are present in two different dataframes."""
#check that the name is in both columns
col1= name in list(df[column1]) and name in list(df1[column1])
#check that first_name is in both columns
col2 = first_name in list(df[column2]) and first_name in list(df1[column2])
#check that both name and first_name are in the same row in the first dataframe
try:
both =(df[df["First_name"]==first_name].index[0]) == (df[df["Name"]==name].index[0])
except:
# if name or first_name does not exist
pass
#check that both name and first_name are in the same row in the second dataframe
try:
both =(df1[df1["First_name"]==first_name].index[0]) == (df1[df1["Name"]==name].index[0])
except:
# if name or first_name does not exist
pass
return col1 and col2 and both
column1 = "Name"
column2= "First_name"
first_name = "Cabrera"
name ="Juan"
check_if_a_person_in_one_data_frame_is_in_another_one(df,df1,column1,column2,first_name,name)
True

- 181
- 8