2

I have a dataset with a lot of email and I want change this:

df = pd.DataFrame( [('[email protected]', 0, 3.0), ('[email protected]', 1, 2.0), 
                    ('[email protected]', 1 ,3.0), ('[email protected]', 1, 1.0), 
                    ('[email protected]', 2, 5.0)]) 

df
0  [email protected]  0  3
1  [email protected]  1  2
2  [email protected]  1  3
3  [email protected]  1  1
4  [email protected]  2  5

to this:

df2 = pd.DataFrame(
[(0, 0, 3.0), (0, 1, 2.0), (0,1 ,3.0), (1, 1, 1.0), (2, 2, 5.0)])

df2
   0  1  2
0  0  0  3
1  0  1  2
2  0  1  3
3  1  1  1
4  2  2  5

i.e, change the email to a number, but the same email stay with the same number

How can I do this?

jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
Kardu
  • 865
  • 3
  • 13
  • 24

1 Answers1

1

Use factorize:

df[0] = pd.factorize(df[0])[0]

print df

   0  1  2
0  0  0  3
1  0  1  2
2  0  1  3
3  1  1  1
4  2  2  5

Or rank:

df[0] = df[0].rank(method='dense') - 1
print df

   0  1  2
0  0  0  3
1  0  1  2
2  0  1  3
3  1  1  1
4  2  2  5
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252