Python Forum

Full Version: Email extraction from websites
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2
(Aug-18-2017, 09:32 AM)wavic Wrote: [ -> ]If you have an Excel document you can get all you want with Pandas or some of the xls modules - Openpyxl for example. Can't provide a code example because I never needed to use this.

I thought it was a common activity to extract emails from sites, so to find the code written on the internet. Thank you for your suggestion...
You can use regex: http://emailregex.com/

But if you want to open a regular Excel file, you've formatting and maybe binary data inside. The newer format is based on xml, the older Excel format is something else. You can use the hammer method and parse the whole file for e-mail addresses.

import re


email_regex = r"([a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+)"

with open('youfile.xls') as fd:
    emails = set(re.findall(email_regex, fd.read()))

print(emails)
Maybe you have luck and the text is encoded as UTF-8.
But it's better to use a Library to open Excel files or you export the Excel sheet as CSV and use the stdlib of Python for this task.
(Aug-18-2017, 11:34 AM)DeaD_EyE Wrote: [ -> ]You can use regex: http://emailregex.com/

But if you want to open a regular Excel file, you've formatting and maybe binary data inside. The newer format is based on xml, the older Excel format is something else. You can use the hammer method and parse the whole file for e-mail addresses.

import re


email_regex = r"([a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+)"

with open('youfile.xls') as fd:
    emails = set(re.findall(email_regex, fd.read()))

print(emails)
Maybe you have luck and the text is encoded as UTF-8.
But it's better to use a Library to open Excel files or you export the Excel sheet as CSV and use the stdlib of Python for this task.

Thank you for the reply
I do not want to extract emails from a file. The excel file contains urls in a column and I was hoping that is possible to exctract all the emails in the site beside each cell.
I wrote excel because I use it frequently .... what I'm interested of all emails extracted from the links.
(Aug-18-2017, 07:46 PM)stefanoste78 Wrote: [ -> ]I do not want to extract emails from a file. The excel file contains urls in a column and I was hoping that is possible to exctract all the emails in the site beside each cell.
You read in Excel with eg Openpyxl, Pandas.
Cell with url address you give to Requests.
Then is web scraping i have a tutorial here.

Email links in html has a unique look that it begins with mailto:.
Then can use a CSS selector like a[href^="mailto"]
Example:
from bs4 import BeautifulSoup

# Simulate a web page
html = '''\
<html>
  <body>
    <p>Email me at <a class="emaillink" href="mailto:[email protected]">[email protected]</a></p>
    <p>Email<a id='foo' href="mailto:[email protected]">[email protected]</a></p>
  </body>
</html>'''

soup = BeautifulSoup(html, 'lxml')
email = soup.select('a[href^="mailto"]')
for link in email:
    print(link.text)
Output:
[email protected] [email protected]
(Aug-18-2017, 08:55 PM)snippsat Wrote: [ -> ]
(Aug-18-2017, 07:46 PM)stefanoste78 Wrote: [ -> ]I do not want to extract emails from a file. The excel file contains urls in a column and I was hoping that is possible to exctract all the emails in the site beside each cell.
You read in Excel with eg Openpyxl, Pandas.
Cell with url address you give to Requests.
Then is web scraping i have a tutorial here.

Email links in html has a unique look that it begins with mailto:.
Then can use a CSS selector like a[href^="mailto"]
Example:
from bs4 import BeautifulSoup

# Simulate a web page
html = '''\
<html>
  <body>
    <p>Email me at <a class="emaillink" href="mailto:[email protected]">[email protected]</a></p>
    <p>Email<a id='foo' href="mailto:[email protected]">[email protected]</a></p>
  </body>
</html>'''

soup = BeautifulSoup(html, 'lxml')
email = soup.select('a[href^="mailto"]')
for link in email:
    print(link.text)
Output:
[email protected] [email protected]

Hello. I noticed that you've already dealt with the case of extracting emails from the links.
The problem is I'm not a programmer and I could not put into practice what you said in the affiliates you posted.
Have you ever done something like where the code fetches excel data? For me it would be easier. I just need to put the links in a column and then activate the extraction.
Pages: 1 2