-1

So I recently am trying to extract text from documents using OCR. Sometimes, the OCR adds a "space" in between characters. This becomes an issue when it is an email address.

For e.g., Email:[email protected] ------> [email protected]

I am trying to use regex to solve this issue.

import re
txt = "wldeub 777-29378-88 @@ Email:adil.idris [email protected] dfhdu fdlfkos"
txt1 = "Email:michael [email protected] 777-123-0000"
txt2 = "Email: john_jebrasky@ gmail.com TX, USA"
txt3 = "john_jebrasky @gmail.com TX, USA"
txt4 = "I am proficient in python. geekcoder [email protected] TX, USA"

out = re.search("Email\:?.+com",txt)
re.sub("Email\:","",re.sub(" ","",out.group(0)))

Unfortunately, this is just a hardcoded fix. NOTE: in some cases, the word Email: might not be present as a prefix with the email. What if there is no Email or, what if the text does not follow any standard pattern??

Ideal output: "[email protected]"

  • 1
    With regular expressions you cannot distinguish between "character rubbish" and "meaningful content". Thus `dfhdu` and `com` is considered the same. Provide more text snippets to eventually see a pattern. – Jan Feb 28 '21 at 17:17
  • I have update the question with more examples – Deepak Sharma Feb 28 '21 at 17:48

1 Answers1

0

It's a bit of a difficult problem to solve with only regular expressions. You can try separating it into two steps.

One where you get a (very) rough estimate of what might be an email.

((?:[^\"@:+={}()|\s]+ ?){1,3}\@ ?\w+(?: ?\. ?\w+)+)

where [^@:+={}()|\s] is a complete shot in the dark, but these are characters that I doubt will suddenly pop up as false positives in OCR. This will essentially try to match 1 to 3 ({1,3}) blocks of text possibly separated by spaces (...\s]+ ?)...) that don't include an of the characters in [^@:+={}()|\s] and come before an @. Then it will try to match a sequence of domain names and their extensions (.co.uk, .com), possibly separated by spaces ?.

Then you can remove all the whitespace from the matched sequences, and check if they're a valid email address with a proper library/regex: How to check for valid email address?

Not the most clean solution, but I hope it helps.

Edit

I see that you're using a capturing group now, that might explain why it didn't work for you if you have tried it. It should be fixed now.

RegexR example

cochaviz
  • 126
  • 6