I have created a Spark dataset from a csv file.
The schema is:
|-- FirstName: string (nullable = true)<br>
|-- LastName: string (nullable = true)<br>
|-- Email: string (nullable = true)<br>
|-- Phone: string (nullable = true)
I am performing deduplication on the email field:
Dataset<Row> customer= spark.read().option("header","true").option("charset","UTF8")
.option("delimiter",",").csv(path);
Dataset<Row> distinct = customer.select(col).distinct();
I would like to create an output csv file with the rows with distinct email Ids.
How to query in order to the retrieve dataset with records with distinct email?
Sample Input:
John David [email protected] 2222
John Smith [email protected] 4444
John D [email protected] 2222
Sample Output:
John David [email protected] 2222
John Smith [email protected] 4444
Thanks in advance