1

I have a data set where some rows are same but belong to different classes. Example -

index Heading 1 Heading 2 Heading 1b Heading 2b Class/Target
row -1 a b c d 0
row -2 t r f k 0
row -3 m u p l 0
row -4 a b c d 1
row -5 m u p l 1
row -6 v r z h 0
row -7 z q y o 1
row -8 w e t a 1

row-1 and row-4 are same rows but with different class. Similar case with row-3 and row-5 There are only two classes.

I want to make those rows to new class say for example -2 It will look like this:

index Heading 1 Heading 2 Heading 1b Heading 2b Class/Target
row -1 a b c d 2
row -2 t r f k 0
row -3 m u p l 2
row -4 a b c d 1
row -5 m u p l 2
row -6 v r z h 0
row -7 z q y o 1
row -8 w e t a 1

We can see those rows are mapped to 2. And the duplicates are also kept in the same order. Previously, I use iloc and iterate. But it takes huge amount of time as the size of the data set is huge. So, I converted into dictionary, it was fine and fast. But it requires bit of manipulation and more coding work. I would like to know how can it be done in a simple way.

Abc1729
  • 15
  • 4

2 Answers2

0

Your headings are pretty strange, since you have columns with the same names. I'll use columns Heading 1, Heading 2, Heading 1b, Heading 2b.

First create a new column in your df marking duplicates rows :

df['Duped'] = df.duplicated(subset=['Heading 1', 'Heading 2', 'Heading 1b', 'Heading 2b'], keep=False).astype(int)

You now have a Duped column, with 1 if row is duped and 0 if not.

Then Modify Class/Target according to it :

df.loc[df['Duped'] == 1, 'Class/Target'] = 2

Then drop your intermediate Duped column :

df = df.drop(columns=['Duped'])
Adept
  • 864
  • 5
  • 17
0

Try the following code:

import pandas as pd
df = pd.DataFrame({'col1':['a', 'b', 'c', 'a', 'b', 'a'], 'col2': [1,2,3,1,2,1], 'target': ['d', 'e', 'f', 'd', 'e', 'd']})

df2 = df[df.duplicated(keep=False)] index_list = df2.groupby(list(df2)).apply(lambda x: tuple(x.index)).tolist()

index_new_category = [i[0] for i in index_list] df.loc[index_new_category,'target'] = 2

index_to_drop = [i[1:] for i in index_list] index_to_drop = list(sum(index_to_drop, ()))

df = df.drop(index_to_drop)

Shrinidhi M
  • 401
  • 2
  • 4