0

This work started by comparing two columns in each data set in pandas.
Previous research:here
A lot of results online show how to compare 2 data frames with 1 column
I'm trying to learn how to compare and extract similarities between two data frames (same & different sizes if possible) using more than 1 column in pandas.

sample input:

df1=pd.DataFrame([[1,None],[1,None,],[1,None],[1,'item_a'],[2,'item_a'],[2,'item_b'],[2,'item_f'],[3,'item_e'],[3,'item_e'],[3,'item_g'],[3,'item_h']],columns=['id','A'])
df2=pd.DataFrame([[1,'item_a'],[1,'item_b'],[1,'item_c'],[1,'item_d'],[2,'item_a'],[2,'item_b'],[2,'item_c'],[2,'item_d'],[3,'item_e'],[3,'item_f'],[3,'item_g'],[3,'item_h']],columns=['id','A'])



 df1
        id  A
    0   1   None
    1   1   None
    2   1   None
    3   1   item_a
    4   2   item_a
    5   2   item_b
    6   2   item_f
    7   3   item_e
    8   3   item_e
    9   3   item_g
    10  3   item_h



df2
    id  A
0   1   item_a
1   1   item_b
2   1   item_c
3   1   item_d
4   2   item_a
5   2   item_b
6   2   item_c
7   2   item_d
8   3   item_e
9   3   item_f
10  3   item_g
11  3   item_h

What I've tried so far:

   1: df1[df1.A.isin(df2.A) & df1.id.isin(df2.id)]

2: df1[ df1[['id', 'A']].isin(df2[['id', 'A']]) ]

The output I got for 1 is close to what I desire:

    id  A
3   1   item_a
4   2   item_a
5   2   item_b
6   2   item_f #this specific row is not desired in the output
7   3   item_e
8   3   item_e #this specific row was raised due to a duplicate in `df1`. It's permitted to show duplicates. Duplicates values are allowed in `df1` but not `df2`.
9   3   item_g
10  3   item_h

Desired output:

    id  A
3   1   item_a
4   2   item_a
5   2   item_b
7   3   item_e
8   3   item_e
9   3   item_g
10  3   item_h

What's not shown: Two data frames have 2500+ rows. df1 can have the same items associated with an id. No duplicate items for an id in df2.
My 2nd try 2: df1[ df1[['id', 'A']].isin(df2[['id', 'A']]) ] is definitely the wrong approach as its matching row and column in df1 to row and column df2 (This output is similar to equals(), I get values from df1 instead of True and NaN instead of False)
Any code, links, suggestions are appreciated.

1 Answers1

0

Required DataFrame

Data Frame side by side

The last df is your required result, 2nd last is mine. I think 3, item_f is not possible as its absent in df1.