Python Forum
Scrape for html based on url string and output into csv
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Scrape for html based on url string and output into csv
#11
So far you have helped me to put together below code - thank you for that.

import csv
import requests
import datetime
import time

from requests import get
from bs4 import BeautifulSoup


with open('data.csv', encoding='utf8') as csvfile:
    reader = csv.reader(csvfile, delimiter=';')
    next(reader)

    count = 0
    
    for row in reader:
        
        timestamp = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")

        url = f'https://www.somedomain.com/result?country=en&q={row[1]}'
        
        headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
        cookies = {'__test': '1bb6e881021f013463740eeb74840b18'}
        content = get(url, headers=headers,  cookies=cookies).content
        soup = BeautifulSoup(content, "lxml")

        table_info = soup.select_one('.table-info')

        mail = table_info.select_one('.col-2 a[href^=mailto]')
        mail = mail.get('href')
        mail_clean = mail.split(':')[1]

        website = soup.find(text='Website:')
        website = table_info.select_one('.col-2 a[target^=_blank]')
        website = website.get('href') 
        
        collected_data = row[1], mail_clean, website, timestamp

        data_list = [["Regcode", "Email", "Website", "Timestamp"],collected_data]
        with open('extracted.csv', 'w', newline='') as file:
            writer = csv.writer(file, delimiter=';')
            writer.writerows(data_list)

        print(row[1], "|", mail_clean,"|", website,"|", timestamp)
        #print("Waiting 3 seconds...")
        #time.sleep(3)
        count+=1

        
The code sort of works, but there are problems with the csv writing part.
I am not sure how to write new line into csv without overwriting the same line over and over again. I messed up the loop and I am not sure how to fix it.

Second thing is that, Its need to to crawl only new entries in the future (every week), so I think it also need to check the extracted.csv every time to avoid duplicate content before It will put a new line into the extracted.csv.

I hope you can give me a hint.
Thank You buddy.
Reply
#12
I post here the entire table structure to perfectly visualize what I try to scrape.

I want to extract the phone, email, website, main activity (li element text without the div)

UPDATE: I forgot to mention that i ran into error because sometimes there is no email or website available vice versa, and code does not understand and breakes the entire cycle. I think there should be some error control somehow.

<table class="table-info">
    <tbody>
        <tr>
            <td class="col-1">
                <div class="col-1-text">Business name</div>
            </td>
            <td class="col-2">
                <div class="col-2-text">Company XYZ&nbsp;</div>
            </td>
        </tr>
        <tr>
            <td class="col-1">
                <div class="col-1-text">Register code:</div>
            </td>
            <td class="col-2">
                <div class="col-2-text">112233558</div>
            </td>
        </tr>


        <tr>
            <td class="col-1">
                <div class="col-1-text">Operating address:</div>
            </td>
            <td class="col-2">
                <div class="col-2-text"><a target="googlemaps" href="https://www.google.com/maps/place/Some-location"
                        class="link-location">Some location strt. 233</a></div>
            </td>
        </tr>
        <tr>
            <td class="col-1">
                <div class="col-1-text">Legal address</div>
            </td>
            <td class="col-2">
                <div class="col-2-text">
                    <a class="link-location" href="https://www.google.com/maps/place/Some-location" target="_new">Some
                        location
                    </a>
                </div>
            </td>
        </tr>
        <tr>
            <td class="col-1">
                <div class="col-1-text">VAT No:</div>
            </td>
            <td class="col-2">
                <div class="col-2-text"><a href="javascript:void(0)" onclick="return getVAT(this, '12345678')">Get VAT
                        liability</a></div>
            </td>
        </tr>
        <tr>
            <td class="col-1">
                <div class="col-1-text">Age:</div>
            </td>
            <td class="col-2">
                <div class="col-2-text">1 year&nbsp;3 months</div>
            </td>
        </tr>
        <tr>
            <td class="col-1">
                <div class="col-1-text">Founded:</div>
            </td>
            <td class="col-2">
                <div class="col-2-text">20/09/2019</div>
            </td>
        </tr>
        <tr>
            <td class="col-1">
                <div class="col-1-text">Capital:</div>
            </td>
            <td class="col-2">
                <div class="col-2-text">2000 USD</div>
            </td>
        </tr>
        <tr>
            <td colspan="2" class="sep"></td>
        </tr>
        <tr>
            <td class="col-1">
                <div class="col-1-text">Phone:</div>
            </td>
            <td class="col-2">
                <div class="col-2-text">123456789</div>
            </td>
        </tr>
        <tr>
            <td class="col-1">
                <div class="col-1-text">E-mail:</div>
            </td>
            <td class="col-2">
                <div class="col-2-text"><a href="mailto:[email protected]">[email protected]</a></div>
            </td>
        </tr>
 <tr>
        <td class="col-1"><div class="col-1-text">Website:</div></td>
        <td class="col-2"><div class="col-2-text"><a href="http://www.somecompany.com" target="_blank">www.somecompany.com</a></div></td>
    </tr>
        <tr>
            <td colspan="2" class="sep"></td>
        </tr>
        <tr>
            <td class="col-1">
                <div class="col-1-text">Representatives:</div>
            </td>
            <td class="col-2">
                <div class="col-2-text">
                    <div class="box-message">
                        <p class="desc">To access information, please</p>
                        <p>
                            <a href="#" onclick="return loginClicked(this, '#');"
                                class="btn btn-small btn-purple link-login">Log in</a>
                        </p>
                    </div>
                </div>
            </td>
        </tr>
        <tr>
            <td colspan="2" class="sep"></td>
        </tr>
        <tr>
            <td class="col-1">
                <div class="col-1-text">
                    Main activity:
                    <span class="tip info" title=""
                        data-original-title="Activities are classified according to EMTAK 2008"></span>
                </div>
            </td>
            <td class="col-2">
                <div class="col-2-text" id="activity_top5ffe2eab23d13">
                    <ul>
                        <li>
                            Computer consultancy activities
                            <div class="main_activities_top_link_wrapper">
                                <a href="https://www.somesite.com/" target="_blank"
                                    onclick="ga('send', 'event', 'check', 'top_btn', 'Anonym');"
                                    class="btn btn-simple btn-open-graph">
                                    <span>Open TOP 20</span> </a>
                            </div>
                        </li>
                    </ul>

                </div>
            </td>
        </tr>


    </tbody>
</table>
Reply
#13
Anyone?
Reply
#14
I have not had to much time to look more into this.
You should try yourself to scrape that table,is the same start as i show before.
Here some hint to to loop find values in table in a loop,
and a link later to Pandas that can make this easier as it can scrape table and also convert to csv df.to_csv().
from bs4 import BeautifulSoup

html = '''\
<table class="table-info">
    <tbody>
        <tr>
            <td class="col-1">
                <div class="col-1-text">Business name</div>
            </td>
            <td class="col-2">
                <div class="col-2-text">Company XYZ&nbsp;</div>
            </td>
        </tr>
        <tr>
            <td class="col-1">
                <div class="col-1-text">Register code:</div>
            </td>
            <td class="col-2">
                <div class="col-2-text">112233558</div>
            </td>
        </tr>


        <tr>
            <td class="col-1">
                <div class="col-1-text">Operating address:</div>
            </td>
            <td class="col-2">
                <div class="col-2-text"><a target="googlemaps" href="https://www.google.com/maps/place/Some-location"
                        class="link-location">Some location strt. 233</a></div>
            </td>
        </tr>
        <tr>
            <td class="col-1">
                <div class="col-1-text">Legal address</div>
            </td>
            <td class="col-2">
                <div class="col-2-text">
                    <a class="link-location" href="https://www.google.com/maps/place/Some-location" target="_new">Some
                        location
                    </a>
                </div>
            </td>
        </tr>
        <tr>
            <td class="col-1">
                <div class="col-1-text">VAT No:</div>
            </td>
            <td class="col-2">
                <div class="col-2-text"><a href="javascript:void(0)" onclick="return getVAT(this, '12345678')">Get VAT
                        liability</a></div>
            </td>
        </tr>
        <tr>
            <td class="col-1">
                <div class="col-1-text">Age:</div>
            </td>
            <td class="col-2">
                <div class="col-2-text">1 year&nbsp;3 months</div>
            </td>
        </tr>
        <tr>
            <td class="col-1">
                <div class="col-1-text">Founded:</div>
            </td>
            <td class="col-2">
                <div class="col-2-text">20/09/2019</div>
            </td>
        </tr>
        <tr>
            <td class="col-1">
                <div class="col-1-text">Capital:</div>
            </td>
            <td class="col-2">
                <div class="col-2-text">2000 USD</div>
            </td>
        </tr>
        <tr>
            <td colspan="2" class="sep"></td>
        </tr>
        <tr>
            <td class="col-1">
                <div class="col-1-text">Phone:</div>
            </td>
            <td class="col-2">
                <div class="col-2-text">123456789</div>
            </td>
        </tr>
        <tr>
            <td class="col-1">
                <div class="col-1-text">E-mail:</div>
            </td>
            <td class="col-2">
                <div class="col-2-text"><a href="mailto:[email protected]">[email protected]</a></div>
            </td>
        </tr>
 <tr>
        <td class="col-1"><div class="col-1-text">Website:</div></td>
        <td class="col-2"><div class="col-2-text"><a href="http://www.somecompany.com" target="_blank">www.somecompany.com</a></div></td>
    </tr>
        <tr>
            <td colspan="2" class="sep"></td>
        </tr>
        <tr>
            <td class="col-1">
                <div class="col-1-text">Representatives:</div>
            </td>
            <td class="col-2">
                <div class="col-2-text">
                    <div class="box-message">
                        <p class="desc">To access information, please</p>
                        <p>
                            <a href="#" onclick="return loginClicked(this, '#');"
                                class="btn btn-small btn-purple link-login">Log in</a>
                        </p>
                    </div>
                </div>
            </td>
        </tr>
        <tr>
            <td colspan="2" class="sep"></td>
        </tr>
        <tr>
            <td class="col-1">
                <div class="col-1-text">
                    Main activity:
                    <span class="tip info" title=""
                        data-original-title="Activities are classified according to EMTAK 2008"></span>
                </div>
            </td>
            <td class="col-2">
                <div class="col-2-text" id="activity_top5ffe2eab23d13">
                    <ul>
                        <li>
                            Computer consultancy activities
                            <div class="main_activities_top_link_wrapper">
                                <a href="https://www.somesite.com/" target="_blank"
                                    onclick="ga('send', 'event', 'check', 'top_btn', 'Anonym');"
                                    class="btn btn-simple btn-open-graph">
                                    <span>Open TOP 20</span> </a>
                            </div>
                        </li>
                    </ul>

                </div>
            </td>
        </tr>


    </tbody>
</table>
'''

soup = BeautifulSoup(html, 'lxml')
>>> table_info = soup.select_one('.table-info')
>>> p1 = table_info.select('.col-1')
>>> p2 = table_info.select('.col-2')
>>> for tag in p1:
...     print(tag.select_one('.col-1-text').text.strip()) 
...     
Business name
Register code:
Operating address:
Legal address
VAT No:
Age:
Founded:
Capital:
Phone:
E-mail:
Website:
Representatives:
Main activity:
>>> 
>>> for tag in p2:
...     print(tag.select_one('.col-2-text').text.strip())    
...     
Company XYZ
112233558
Some location strt. 233
Some
                        location
Get VAT
                        liability
1 year 3 months
20/09/2019
2000 USD
123456789
[email protected]
www.somecompany.com
To access information, please

Log in
Computer consultancy activities
                            

Open TOP 20

Link to NoteBook,as you see it organize table into a DataFrame and the can be used to output to many format as eg csv.
IO tools (text, CSV, HDF5, …)
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Trying to scrape data from HTML with no identifiers pythonpaul32 2 843 Dec-02-2023, 03:42 AM
Last Post: pythonpaul32
Lightbulb Python Obstacles | Kung-Fu | Full File HTML Document Scrape and Store it in MariaDB BrandonKastning 5 2,879 Dec-29-2021, 02:26 AM
Last Post: BrandonKastning
  Python Obstacles | Karate | HTML/Scrape Specific Tag and Store it in MariaDB BrandonKastning 8 3,161 Nov-22-2021, 01:38 AM
Last Post: BrandonKastning
  HTML multi select HTML listbox with Flask/Python rfeyer 0 4,617 Mar-14-2021, 12:23 PM
Last Post: rfeyer
  Pandas tuple list returning html string shansaran 0 1,704 Mar-23-2020, 08:44 PM
Last Post: shansaran
  Python3 + BeautifulSoup4 + lxml (HTML -> CSV) - How to loop to next HTML/new CSV Row BrandonKastning 0 2,357 Mar-22-2020, 06:10 AM
Last Post: BrandonKastning
  scrape data 1 go to next page scrape data 2 and so on alkaline3 6 5,170 Mar-13-2020, 07:59 PM
Last Post: alkaline3
  How do I get rid of the HTML tags in my output? glittergirl 1 3,720 Aug-05-2019, 08:30 PM
Last Post: snippsat
  Formatting Output after Web Scrape yoitspython 2 2,468 Jul-30-2019, 08:39 PM
Last Post: yoitspython
  Basic Syntax/HTML Scrape Questions sungar78 5 3,777 Sep-06-2018, 09:32 PM
Last Post: sungar78

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020