I am trying to get all the unique emails from a HTML page into an array. The file is HUGE and there are no real patterns to get just the emails.
Here is an example html called GetEmails.html --- The actual file will have css and much more code to sift through. In this example, notice the unique patterns of emails. In short not all are separated by spaces but some with commas and semi colons etc..
<html>
<body>
<p>This is some text and here is an email [email protected] and in this text we will see lots of emails like [email protected]; [email protected], [email protected] or even dot orgs too like [email protected] and all types such as [email protected],[email protected] and even [email protected] some might be bold [email protected] and some will look like this Email:<strong>[email protected]</strong>
</p>
<p><u>There will be pages and pages and pages of text to sift thru so get the emails into an array.</u></p>
<p>This is some text and here is an email [email protected] and in this text we will see lots of emails like [email protected]; [email protected], [email protected] or even dot orgs too like [email protected] and all types such as [email protected],[email protected] and even [email protected] some might be bold [email protected] and some will look like this Email:<strong>[email protected]</strong> and repeat This is some text and here is an email [email protected] and in this text we will see lots of emails like [email protected]; [email protected], [email protected] or even dot orgs too like [email protected] and all types such as [email protected],[email protected] and even [email protected] some might be bold [email protected] and some will look like this Email:<strong>[email protected]</strong></p>
<p> </p>
</body>
</html>
I thought to use an explode with spaces but that might not work and might use up too much resources. Just wondering if there is a simple function in php to help me get all the emails into an array. Here is what I tried.
<?
$lines = file('GetEmails.html');
foreach ($lines as $line_num => $line) {
/// Finds if line has email.
if (preg_match('/\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b/si', $line))
{
// Puts that line into an array
$line = explode(" " , strip_tags($line));
// Finds if one of the itmes has an @ sign
$fl_array = preg_grep("/@/", $line);
// Puts that email in an array
$TheEmails[] = trim($fl_array);
// Puts only the unique emails an an array
$UniqueEmails= array_unique($TheEmails);
?>
This code above works, however; the HUGE file I will use I am afraid its using resources unnecessarily. Also it will not account for emails separated by commas like this [email protected],[email protected]
Any ideas on the best way to do this? At the very least it would be VERY VERY helpful to learn how to do this the best way even if I can only get the emails that are separated by spaces etc...
Hope this makes sense. Thanks so much!