Let's assume that you have a dataframe of users. In spark, one could create a sample of such a dataframe like this:
import spark.implicits._
val df = Seq(("me", "[email protected]"),
("me", "[email protected]"),
("you", "[email protected]")).toDF("user_id", "email")
df.show()
+-------+---------------+
|user_id| email|
+-------+---------------+
| me| [email protected]|
| me| [email protected]|
| you|[email protected]|
+-------+---------------+
Now, the logic would be very similar as the one you have in SQL:
df.groupBy("user_id")
.agg(countDistinct("email") as "count")
.where('count > 1)
.show()
+-------+-----+
|user_id|count|
+-------+-----+
| me| 2|
+-------+-----+
Then you can add a .drop("count")
or a .select("user_id")
to only keep users.
Note that there is no having
clause in spark. Once you have called agg
to aggregate your dataframe by user, you have a regular dataframe on which you can call any transformation function, such as a filter on the count
column here.