1

I have multiple email addresses within a field and from the dataframe, I have to validate if the email address has @ .com and separated by a ; delimiter.

a
-----------------------------------------------
[email protected];[email protected]
sample
[email protected]
[email protected];test2@email.,[email protected]

Expected output :

a                                                 a_new
---------------------------------------------------------
[email protected];[email protected]                Valid
sample                                            Invalid
[email protected]                                  Valid
[email protected];test2@email.,[email protected]   Invalid

The 2nd and fourth records are invalid because of @ and .com are missing even for a single email address and for multiple email addresses test2@email., -> com is missing along with a different delimiter.

I was able to pull out for a single email address test. Not sure how to test if there are multiple email addresses.

blackbishop
  • 30,945
  • 11
  • 55
  • 76
Santosh
  • 21
  • 2
  • Show us the code you're using right now? One option for validating multiple addresses would be to split on `;` and then validate each of the resulting items. – larsks Jan 27 '21 at 22:30
  • @Santosh, you posted a brilliant question `pyspark/hive count using window function` but deleted. I have a solution. Let me know if you still need help and will post the answer – wwnde Feb 01 '22 at 12:59

1 Answers1

0

For complexe email validation Regex you can see this post.

But if you want only to verify an email has the form [email protected], you can use this simple regex .+@.+\.com and to check there is a list of emails separated by ; use : ^(.+@.+\.com)(; .+@.+\.com)*$ with the function rlike:

from pyspark.sql import functions as F

data = [
    ("[email protected];[email protected]",),
    ("sample",),
    ("[email protected]",),
    ("[email protected];test2@email.,[email protected] ",)
]
df = spark.createDataFrame(data, ["a"])

df1 = df.withColumn("a_new",
                    F.when(
                        F.col("a").rlike("^(.+@.+\\.com)(; .+@.+\\.com)*$"),
                        "Valid"
                    ).otherwise("Invalid")
                  )

df1.show(truncate=False)

#+------------------------------------------------+-------+
#|a                                               |a_new  |
#+------------------------------------------------+-------+
#|[email protected];[email protected]              |Valid  |
#|sample                                          |Invalid|
#|[email protected]                                |Valid  |
#|[email protected];test2@email.,[email protected] |Invalid|
#+------------------------------------------------+-------+
blackbishop
  • 30,945
  • 11
  • 55
  • 76