Handling conflicting cases pandas python

Question

I have a data set where some rows are same but belong to different classes. Example -

index	Heading 1	Heading 2	Heading 1b	Heading 2b	Class/Target
row -1	a	b	c	d	0
row -2	t	r	f	k	0
row -3	m	u	p	l	0
row -4	a	b	c	d	1
row -5	m	u	p	l	1
row -6	v	r	z	h	0
row -7	z	q	y	o	1
row -8	w	e	t	a	1

row-1 and row-4 are same rows but with different class. Similar case with row-3 and row-5 There are only two classes.

I want to make those rows to new class say for example -2 It will look like this:

index	Heading 1	Heading 2	Heading 1b	Heading 2b	Class/Target
row -1	a	b	c	d	2
row -2	t	r	f	k	0
row -3	m	u	p	l	2
row -4	a	b	c	d	1
row -5	m	u	p	l	2
row -6	v	r	z	h	0
row -7	z	q	y	o	1
row -8	w	e	t	a	1

We can see those rows are mapped to 2. And the duplicates are also kept in the same order. Previously, I use iloc and iterate. But it takes huge amount of time as the size of the data set is huge. So, I converted into dictionary, it was fine and fast. But it requires bit of manipulation and more coding work. I would like to know how can it be done in a simple way.

Because each row corresponds to a line. After the process, I will add a column based on each rows. — Abc1729, Jul 28 '21 at 09:44

Adept · Accepted Answer · 2021-07-28T11:47:16.590

0

Your headings are pretty strange, since you have columns with the same names. I'll use columns Heading 1, Heading 2, Heading 1b, Heading 2b.

First create a new column in your df marking duplicates rows :

df['Duped'] = df.duplicated(subset=['Heading 1', 'Heading 2', 'Heading 1b', 'Heading 2b'], keep=False).astype(int)

You now have a Duped column, with 1 if row is duped and 0 if not.

Then Modify Class/Target according to it :

df.loc[df['Duped'] == 1, 'Class/Target'] = 2

Then drop your intermediate Duped column :

df = df.drop(columns=['Duped'])

edited Jul 28 '21 at 11:47

answered Jul 28 '21 at 10:11

Adept

864
5
17

Thanks,
Yea the columns are with different names heading-{1,2,3,4} – Abc1729 Jul 28 '21 at 11:43

score 0 · Answer 2 · answered Jul 28 '21 at 11:01

Try the following code:

import pandas as pd
df = pd.DataFrame({'col1':['a', 'b', 'c', 'a', 'b', 'a'], 'col2': [1,2,3,1,2,1], 'target': ['d', 'e', 'f', 'd', 'e', 'd']})
df2 = df[df.duplicated(keep=False)]
index_list = df2.groupby(list(df2)).apply(lambda x: tuple(x.index)).tolist()
index_new_category = [i[0] for i in index_list]
df.loc[index_new_category,'target'] = 2
index_to_drop = [i[1:] for i in index_list]
index_to_drop = list(sum(index_to_drop, ()))
df = df.drop(index_to_drop)

Handling conflicting cases pandas python

2 Answers2