I'm doing an exercise in which I have to create a program that takes the input of a clipboard copy, parses its contents, and returns a list (in the non-python sense) of the email addresses contained within.
The source file for said input is a sample public domain PDF that has the following layout:
It looks simple enough, except when I copy/paste that input normally (ie wihtout using my program), I get the following output:
Kasey [email protected] [email protected] [email protected] [email protected] [email protected]
You see where the problem lies: the surname is stuck to the beginning of the email address, and thus my program wouldn't be able to parse the addresses correctly.
Would there be a way, regex or otherwise, to somehow separate these during parsing, or is there nothing to do short of doing it by hand or reformat the file?
So far, my regex looks like this:
email_regex = re.compile(r'''
[a-zA-Z0-9_.+]+ # name part
@ # @
[a-zA-Z0-9_.+]+\.\w{2,3} # domain name part
''', re.VERBOSE)