So I recently am trying to extract text from documents using OCR. Sometimes, the OCR adds a "space" in between characters. This becomes an issue when it is an email address.
For e.g., Email:[email protected] ------> [email protected]
I am trying to use regex to solve this issue.
import re
txt = "wldeub 777-29378-88 @@ Email:adil.idris [email protected] dfhdu fdlfkos"
txt1 = "Email:michael [email protected] 777-123-0000"
txt2 = "Email: john_jebrasky@ gmail.com TX, USA"
txt3 = "john_jebrasky @gmail.com TX, USA"
txt4 = "I am proficient in python. geekcoder [email protected] TX, USA"
out = re.search("Email\:?.+com",txt)
re.sub("Email\:","",re.sub(" ","",out.group(0)))
Unfortunately, this is just a hardcoded fix. NOTE: in some cases, the word Email: might not be present as a prefix with the email. What if there is no Email or, what if the text does not follow any standard pattern??
Ideal output: "[email protected]"