I'm trying to use regex in scrapy to find all email addresses on a page.
I'm using this code:
item["email"] = re.findall('[\w\.-]+@[\w\.-]+', response.body)
Which works almost perfectly: it grabs all the emails and gives them to me. However what I want is this: that it doesn't give me a repeat before it actually parses, even if there are more than one of the same email address.
I'm getting responses like this (which is correct):
{'email': ['[email protected]',
'[email protected]',
'[email protected]',
'[email protected]',
'[email protected]']}
However I want to only show the unique addresses which would be
{'email': ['[email protected]',
'[email protected]',
'[email protected]']}
If you want to throw in how to only collect the email and not that
'[email protected]'
that is helpful also.
Thanks everyone!