pyspark add new column field with the data frame row number

Question

Hy, I'm trying build a recommendation system with Spark

I have a data frame with users email and movie rating.

df = pd.DataFrame(np.array([["[email protected]",2,3],["[email protected]",5,5],["[email protected]",8,2],["[email protected]",9,3]]), columns=['user','movie','rating'])

sparkdf = sqlContext.createDataFrame(df, samplingRatio=0.1)

           user movie rating
  [email protected]     2      3
  [email protected]     5      5
  [email protected]     8      2
  [email protected]     9      3

My first doubt it is, pySpark MLlib doesn't accept emails I'm correct? Because this I need to change the email by a Primary key.

My approach was create a temporary table, select distinct user and now I want add a new column with a row number (and this number will be the primary key for each user.

sparkdf.registerTempTable("sparkdf")

DistinctUsers = sqlContext.sql("Select distinct user FROM sparkdf")

What I have

+------------+
|        user|
+------------+
|[email protected]|
|[email protected]|
|[email protected]|
+------------+

What I want

+------------+
|        user| PK
+------------+
|[email protected]| 1
|[email protected]| 2
|[email protected]| 3
+------------+

Next I will do a join and obtain my final data frame to use in MLlib

user movie rating
  1     2      3
  1     5      5
  2     8      2
  3     9      3

Regards and thanks for your time.

score 2 · Accepted Answer · edited May 23 '17 at 12:23

2

Primary keys with Apache Spark practically answers your question but in this particular case using StringIndexer could be a better choice:

from pyspark.ml.feature import StringIndexer

indexer = StringIndexer(inputCol="user", outputCol="user_id")
indexed = indexer.fit(sparkdf ).transform(sparkdf)

edited May 23 '17 at 12:23

Community

1
1

answered Feb 03 '16 at 11:06

zero323

322,348
103
959
935

pyspark add new column field with the data frame row number

1 Answers1

Related