Hy, I'm trying build a recommendation system with Spark
I have a data frame with users email and movie rating.
df = pd.DataFrame(np.array([["[email protected]",2,3],["[email protected]",5,5],["[email protected]",8,2],["[email protected]",9,3]]), columns=['user','movie','rating'])
sparkdf = sqlContext.createDataFrame(df, samplingRatio=0.1)
user movie rating
[email protected] 2 3
[email protected] 5 5
[email protected] 8 2
[email protected] 9 3
My first doubt it is, pySpark MLlib doesn't accept emails I'm correct? Because this I need to change the email by a Primary key.
My approach was create a temporary table, select distinct user and now I want add a new column with a row number (and this number will be the primary key for each user.
sparkdf.registerTempTable("sparkdf")
DistinctUsers = sqlContext.sql("Select distinct user FROM sparkdf")
What I have
+------------+
| user|
+------------+
|[email protected]|
|[email protected]|
|[email protected]|
+------------+
What I want
+------------+
| user| PK
+------------+
|[email protected]| 1
|[email protected]| 2
|[email protected]| 3
+------------+
Next I will do a join and obtain my final data frame to use in MLlib
user movie rating
1 2 3
1 5 5
2 8 2
3 9 3
Regards and thanks for your time.