Is there a way to parse an email address that's not separated from the rest of the text?

Question

I'm doing an exercise in which I have to create a program that takes the input of a clipboard copy, parses its contents, and returns a list (in the non-python sense) of the email addresses contained within.

The source file for said input is a sample public domain PDF that has the following layout:

It looks simple enough, except when I copy/paste that input normally (ie wihtout using my program), I get the following output:

Kasey [email protected] [email protected] [email protected] [email protected] [email protected]

You see where the problem lies: the surname is stuck to the beginning of the email address, and thus my program wouldn't be able to parse the addresses correctly.

Would there be a way, regex or otherwise, to somehow separate these during parsing, or is there nothing to do short of doing it by hand or reformat the file?

So far, my regex looks like this:

email_regex = re.compile(r'''

[a-zA-Z0-9_.+]+             # name part

@                           # @

[a-zA-Z0-9_.+]+\.\w{2,3}    # domain name part

''', re.VERBOSE)

I am afraid regex won't help, your email usernames are not following any generic pattern wrt names. — Wiktor Stribiżew, Mar 01 '21 at 10:30
Is there a way to turn something like `"Kasey [email protected]"` into `"[email protected]"` automatically? No, there isn't, forget it. But a tool that can read the PDF *structure* instead of trying to dissect the mess that copy&paste creates, you could have more luck. For example, give [pdfminer](https://pypi.org/project/pdfminer/) a spin and see how far it gets you with your files ([see](https://stackoverflow.com/questions/26494211/extracting-text-from-a-pdf-file-using-pdfminer-in-python)). — Tomalak, Mar 01 '21 at 10:31
You probably wanna have a look at this post: https://stackoverflow.com/questions/55139685/how-to-extract-email-from-pdf — cochaviz, Mar 01 '21 at 11:26
Does this answer your question? [how to extract email from pdf](https://stackoverflow.com/questions/55139685/how-to-extract-email-from-pdf) — Ryszard Czech, Mar 01 '21 at 21:27

Leonardo Scotti · Answer 1 · 2021-03-01T13:14:40.943

Pattern

sample = 'Kasey [email protected] [email protected] [email protected] [email protected] [email protected]'

pattern = '(?:([a-zA-Z0-9_.]+)@([a-z]+)\.([a-z]{2,5}))'
result =[{"name": x, "provider": y, "domain": z} for x,y,z in re.findall(pattern, sample)]

output:

[{'name': 'Mcbridemcbrid17', 'provider': 'gmail', 'domain': 'com'},
{'name': 'Cohencohe1696', 'provider': 'yahoo', 'domain': 'com'},
{'name': 'Waltonhwalton3', 'provider': 'hotmail', 'domain': 'com'},
{'name': 'Deanjacquesd', 'provider': 'att', 'domain': 'net'},
{'name': 'Clevelandncleveland88', 'provider': 'mac', 'domain': 'com'}]

Is there a way to parse an email address that's not separated from the rest of the text?

1 Answers1