Python Forum
Scrape for html based on url string and output into csv
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Scrape for html based on url string and output into csv
#1
Crawl an email from specified website.

I have list of a specific company registration codes in csv format which are updated weekly basis.

I want to crawl all email address from source website which have those specific corresponding company email addresses and put the email address to new csv file.

Source addresses where the email what needs to be crawled looks like this:
http://www.somesite.com/result?country=en&q=1232498 / "q" value equals variable (comapny registration code) with each different page where the email is).

Each address string which needed to crawl is located in csv file (starting from second column with header "regcode")
(source table structure: compname | regcode | othercol1 | othercol2) (columns are separated by semicolon ;)

The email what need to be crawled is located between the html tags of each page:
Output:
<table class="table-info"> <tr>..</tr> <tr>..</tr> <tr>..</tr> <tr>..</tr> <tr>..</tr> <tr>..</tr> <tr>..</tr> <tr>..</tr> <tr>..</tr> <tr>..</tr> <tr> <td class="col-1"><div class="col-1-text">E-mail:</div></td> <td class="col-2"><div class="col-2-text"><a href="mailto:[email protected]">[email protected]</a></div></td> </tr> </table>
The crawled email should be put into new csv file, called extracted.csv.

The extracted.csv table structure should be as following:
regcode | email

Explanation: the same company registration code which is used as crawl string, should be put into the new csv file belongside the crawled email address.

This process should be triggered every week and automation should look out for new entires only which are updated in the csv file.
Reply
#2
So have you tried something?
The task is manageable with basic skill in Python and looked a little at tool needed like Requests,Bs4,lxml,csv(module).
Look at Web-Scraping part-1
Quick hint:
from bs4 import BeautifulSoup

html = '''\
<table class="table-info">
<tr>
    <td class="col-1"><div class="col-1-text">E-mail:</div></td>
    <td class="col-2"><div class="col-2-text"><a href="mailto:[email protected]">[email protected]</a></div></td>
</tr>
</table>'''

soup = BeautifulSoup(html, 'lxml')
>>> mail = soup.select_one('.col-2')
>>> mail
<td class="col-2"><div class="col-2-text"><a href="mailto:[email protected]">[email protected]</a></div></td>
>>> mail.select_one('a').get('href')
'mailto:[email protected]
dana likes this post
Reply
#3
Hello.

Thanks for the quick hint :)

I think I need to use Scrapy, because the csv file contains over 100K rows of data / companies and that means over 100K web requests.

I am very new to this, so any help is highly appreciated!

Thanks Smile
Reply
#4
(Jan-11-2021, 12:19 AM)dana Wrote: I think I need to use Scrapy, because the csv file contains over 100K rows of data / companies and that means over 100K web requests.
Scrapy could possible be used for this.
I would start with a smaller test file and just use basic tool like shown eg BS with lxml(very fast parser C speed).
Then see how long time it take on sample file.
Can also look post there you see i use concurrent.futures to speed it up.

Look at this Post for spilt csv with Pandas and use in then use in Scrapy.
The chuck csv from Pandas can also be used in method that i have talked about.
dana likes this post
Reply
#5
So, i started to read the csv file to get the data like so:

import csv

with open('data.csv', encoding='utf8') as csv_file:
    csv_reader = csv.DictReader(csv_file, delimiter=';')
    
    count = 0

    for row in csv_reader:
        print(row['regcode'])
Now, I am clueless how to loop the csv row as request url parameter q.

eg. http://www.somesite.com/result?country=en&q=123456789



(Jan-11-2021, 12:06 PM)snippsat Wrote:
(Jan-11-2021, 12:19 AM)dana Wrote: I think I need to use Scrapy, because the csv file contains over 100K rows of data / companies and that means over 100K web requests.
Scrapy could possible be used for this.
I would start with a smaller test file and just use basic tool like shown eg BS with lxml(very fast parser C speed).
Then see how long time it take on sample file.
Can also look post there you see i use concurrent.futures to speed it up.

Look at this Post for spilt csv with Pandas and use in then use in Scrapy.
The chuck csv from Pandas can also be used in method that i have talked about.
Reply
#6
(Jan-11-2021, 11:49 PM)dana Wrote: Now, I am clueless how to loop the csv row as request url parameter q.
Could post a sample of the .csv file.
See if this helps.
data.csv:
name;regcode;number
Salah;111;22
ali;222;33
Ranard;333;44
import csv

with open('data.csv') as csvfile:
    reader = csv.reader(csvfile, delimiter=';')
    next(reader)
    for row in reader:
        #print(row[1])
        url = f'http://www.somesite.com/result?country=en&q={row[1]}'
        print(url) 
Output:
http://www.somesite.com/result?country=en&q=111 http://www.somesite.com/result?country=en&q=222 http://www.somesite.com/result?country=en&q=333
Reply
#7
Great,
So far, I understand now. Thanks.

How can I get the scraped email without the "mailto:"

The e-mail address is located where I referenced on the first post.

<table class="table-info">
<tr>
    <td class="col-1"><div class="col-1-text">E-mail:</div></td>
    <td class="col-2"><div class="col-2-text"><a href="mailto:[email protected]">[email protected]</a></div></td>
</tr>
</table>
So, it should look inside path:
1. where is table with class "table-info"
2. where is div with class "col-2-text"
3. where is hyperlink with mailto:
4. extract the clean email only without "mailto:" or the text inside a tags

Any ideas?
Thanks!
Reply
#8
When get string back(get('href')) there is no parser can do,
then use normal Python string methods or regex.
A simple split(':') is all that's needed.
>>> table_info = soup.select_one('.table-info')
>>> mail = table_info.select_one('.col-2 a')
>>> mail = mail.get('href')
>>> mail
'mailto:[email protected]'
>>> mail_clean = mail.split(':')[1]
>>> mail_clean
'[email protected]'
dana likes this post
Reply
#9
I put it to the test on live url, but i am missing something, it will show error:


from bs4 import BeautifulSoup
from requests import get
 
page = "http://py123.epizy.com/index.html"
content = get(page).content
soup = BeautifulSoup(content, "lxml")

table_info = soup.select_one('.table-info')
mail = table_info.select_one('.col-2 a')
mail = mail.get('href')
mail_clean = mail.split(':')[1]
print(mail_clean)
Output:
File "C:\Users\pc\Desktop\python\test.py", line 9, in <module> mail = table_info.select_one('.col-2 a') AttributeError: 'NoneType' object has no attribute 'select_one'
Reply
#10
Look at content you get back,eg print(soup).
<noscript>This '
 'site requires Javascript to work,.....
So i don't know if just test this on a server(that make this more difficult) that may not be needed for this this task.
Usually when a site use a lot of Javascript can use Selenium

As this is just a simple test of a server that not may be needed for this task,can bypass it be passing in the cookie.
from bs4 import BeautifulSoup
from requests import get

page = "http://py123.epizy.com/index.html"
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
cookies = {'__test': '1bb6e881021f013463740eeb74840b18'}
content = get(page, headers=headers,  cookies=cookies).content
soup = BeautifulSoup(content, "lxml")

table_info = soup.select_one('.table-info')
mail = table_info.select_one('.col-2 a')
mail = mail.get('href')
mail_clean = mail.split(':')[1]
print(mail_clean)
Output:
[email protected]
dana likes this post
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Trying to scrape data from HTML with no identifiers pythonpaul32 2 795 Dec-02-2023, 03:42 AM
Last Post: pythonpaul32
Lightbulb Python Obstacles | Kung-Fu | Full File HTML Document Scrape and Store it in MariaDB BrandonKastning 5 2,816 Dec-29-2021, 02:26 AM
Last Post: BrandonKastning
  Python Obstacles | Karate | HTML/Scrape Specific Tag and Store it in MariaDB BrandonKastning 8 3,090 Nov-22-2021, 01:38 AM
Last Post: BrandonKastning
  HTML multi select HTML listbox with Flask/Python rfeyer 0 4,528 Mar-14-2021, 12:23 PM
Last Post: rfeyer
  Pandas tuple list returning html string shansaran 0 1,664 Mar-23-2020, 08:44 PM
Last Post: shansaran
  Python3 + BeautifulSoup4 + lxml (HTML -> CSV) - How to loop to next HTML/new CSV Row BrandonKastning 0 2,328 Mar-22-2020, 06:10 AM
Last Post: BrandonKastning
  scrape data 1 go to next page scrape data 2 and so on alkaline3 6 5,087 Mar-13-2020, 07:59 PM
Last Post: alkaline3
  How do I get rid of the HTML tags in my output? glittergirl 1 3,691 Aug-05-2019, 08:30 PM
Last Post: snippsat
  Formatting Output after Web Scrape yoitspython 2 2,433 Jul-30-2019, 08:39 PM
Last Post: yoitspython
  Basic Syntax/HTML Scrape Questions sungar78 5 3,732 Sep-06-2018, 09:32 PM
Last Post: sungar78

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020