This is the regular expression I have formed so far:
/(?:("?(?:.*)"?)\s*)?\s<(.*@.*)>|(?:mailto:(.*@.*))|(.*@.*)/gi
You can check it out at regex101
I'm trying to extract 'Name' & 'Email' from the following:
John Smith <[email protected]>
John Smith <[email protected]>
"John Smith" <[email protected]>
"John" <[email protected]>
John Smith<[email protected]>
<[email protected]>
[email protected]
mailto:[email protected]
"John"<[email protected]>
To: John Smith <[email protected]>
From: John Smith <[email protected]>
Reply-to: [email protected]
Return-path: <[email protected]>
Message-id: <[email protected]>
References: <[email protected]>
Original-recipient: rfc822;[email protected]
for [email protected]
ESMTPSA id <[email protected]>
domain of [email protected]
[email protected]
(ORCPT [email protected])
Having started from scratch, I feel as if I'm almost there - but having trouble with 3 things:
Stripping double quotes from the first capturing group
Dealing with the whitespace missing variant:
John Smith<[email protected]>
False positives in the 'Name' field for the latter block, so I need a way of excluding these (perhaps using the preceding
:
,:
,=
,for
,id
,of
?)
As a complete regular expression novice, I would appreciate a little direction from someone knowledgeable on how I might overcome these issues.
For the curious, I've unfortunately lost my CardDAV and thus all contacts, so in true Linux fashion, I'm going to rebuild a list of emails by manually parsing my entire raw MBOX, sorting by most common, and go from there.
I will be using bash grep
, or perl sed
.
Thank you for you time!