I'm trying to scrape email addresses from a page and having some trouble getting the parent element that contains the email '@' symbol. The emails are embedded within different element tags so I'm unable to just pick them out. There's about 50,000 or so pages that I have to go through.
url = 'https://sec.report/Document/0001078782-20-000134/#f10k123119_ex10z22.htm'
Here are some examples (couple are from different pages I have to scrape):
<div style="border-bottom:1px solid #000000">**[email protected]**</div>
<div class="f3c-8"><u**>[email protected]**</u></div>
<p style="margin-bottom:0pt;margin-top:0pt;;text-indent:0pt;;font-family:Arial;font-size:11pt;font-weight:normal;font-style:normal;text-transform:none;font-variant: normal;">Email: **[email protected]**; Phone: 858-320-8244</p>
<td class="f8c-43">E-mail: <u>[email protected]</u></td>
<p class="f7c-4">Email: [email protected]</p>
What I have tried:
- I tried find_all('div') to get the ResultSet of all the divs to get the ones that has '@' symbol in it.
div = page.find_all('div')
for each in div:
if '@' in each.text:
print(each.text)
When I did this, due to the body being in a 'div', it printed the whole page. Fail. Since the emails are embedded within different tags, it seems inefficient for this method
- Using Regular Expression. I tried using regular expression to pick out the emails but it gets bunch of texts that's not usable which I would have to manually split up, replace characters, etc. This just seemed a daunting task to go through all the different scenarios.
import re
emails = re.findall('\S+@\S+', str(page))
for each in emails:
print(each)
Doing this gave me something like this :
hidden;}@media
#000000">[email protected]</div>
#000000">[email protected]
#000000">[email protected]</div>
#000000">[email protected]</div>
#000000">[email protected]</div></p>
#000000">[email protected]</div></p>
[email protected])</div>
#000000">[email protected]</div>.
href="http://@umich.edu">@umich.edu</a></li><li><a
Now I can go in and split some of the texts using .split('<') and then split again, etc. but they're not all same and since I have to scrape 50,000+ pages with 100 entries in each page, there's a lot I have to scrape and take into consideration.
I tried looking on google and stackoverflow but all I can find are solutions where people are looking for the text within a certain element, etc.
What I need is 'How to find the parent element that contains an email' specifically
I don't think I would need to use Selenium for this since the issue would be similar to using Beautifulsoup and the site is not JavaScript rendered other than some of the pages being a pdf, which is whole another issue.
Any insight, help or advice is appreciated. Thanks.