How to get the COUNT of emails for each id in Scala

Question

I use this query in SQL to get return how many user_id's have more than one email. How would I write this same query against a users DataFrame in Scala? also how would I be able to return to exact emails for each user_id

SELECT DISTINCT user_id
FROM Users
Group by user_id
Having count(DISTINCT email) > 1

Perhaps this might help: https://stackoverflow.com/questions/38848866/spark-group-by-having-with-dataframe-syntax — Tzach Zohar, May 10 '19 at 19:18

Oli · Accepted Answer · 2019-05-10T20:43:15.370

Let's assume that you have a dataframe of users. In spark, one could create a sample of such a dataframe like this:

import spark.implicits._
val df = Seq(("me", "[email protected]"),
             ("me", "[email protected]"),
             ("you", "[email protected]")).toDF("user_id", "email")
df.show()

+-------+---------------+
|user_id|          email|
+-------+---------------+
|     me| [email protected]|
|     me| [email protected]|
|    you|[email protected]|
+-------+---------------+

Now, the logic would be very similar as the one you have in SQL:

df.groupBy("user_id")
  .agg(countDistinct("email") as "count")
  .where('count > 1)
  .show()

+-------+-----+
|user_id|count|
+-------+-----+
|     me|    2|
+-------+-----+

Then you can add a .drop("count") or a .select("user_id") to only keep users.

Note that there is no having clause in spark. Once you have called agg to aggregate your dataframe by user, you have a regular dataframe on which you can call any transformation function, such as a filter on the count column here.

How to get the COUNT of emails for each id in Scala

1 Answers1