Webscraper for multiple urls

Milan · Sep-21-2020, 06:08 PM

Hello team,

I would like to share a script I have created.

It gets the name and price of a product for each url.

It looks like that:

# USED LIBRARIES
import urllib.request
from bs4 import BeautifulSoup

#URLS FROM WHICH NAME AND PRICE FROM EACH PRODUCT ARE RETRIEVED. ALL PAGES SHOULD HAVE THE SAME FORMAT
urls = ['https://gigatron.rs/ssd/wd-ssd-green-series-wds480g2g0a-193671',
       'https://gigatron.rs/ssd/wd-ssd-blue-250gb-25-sata-iiiwds250g2b0a-250gb-25-sata-iii-do-550-mbs-125220',
       'https://gigatron.rs/ssd/silicon-power-ssd-512gb-25-sata-iii-ace-a55sp512gbss3a55s25-512gb-25-sata-iii-do-560-mbs-144553',
       'https://gigatron.rs/ssd/crucial-ssd-bx500-serijact120bx500ssd1-165010']

#LIST WERE THE NAME AND PRICE ARE STORED
data = []    

#THE MAGIC HAPPENS HERE
for i in urls:
    page = urllib.request.urlopen(i)
    soup = BeautifulSoup(page, features='lxml')
    name = soup.find('h1', {'itemprop':'name'}).text
    price = price = soup.find('span', {'itemprop':'price'}).text
    p = [name, price]
    data.append(p)

#DISPLAYS RESULTS
for j in data:
    print(j)

Any input on how to improve it or simply discuss about is welcome.

**Larz60+** · Sep-21-2020, 08:36 PM

Although urllib suffices in this instance, I'd just suggest using requests (future code) rather than urllib
Requests provides a higher level HTTP client interface.

**scidam** · Sep-21-2020, 11:33 PM

1) I would recommend to use meaningful variable names (e.g. url instead of i): for url in urls:
2) Typo (Line No 19): price = price = .
3) if content of the webpage was changed and there was no such things as h1 and price anymore. What would the program do in this case?
4) what if the url doesn't exist...
5) You can try to process several urls in "parallel" (e.g. using Threads) or asynchronously.

Milan · Sep-22-2020, 05:19 PM

(Sep-21-2020, 11:33 PM)scidam Wrote: 1) I would recommend to use meaningful variable names (e.g. url instead of i): for url in urls:
2) Typo (Line No 19): price = price = .
3) if content of the webpage was changed and there was no such things as h1 and price anymore. What would the program do in this case?
4) what if the url doesn't exist...
5) You can try to process several urls in "parallel" (e.g. using Threads) or asynchronously.

So this is the version with the suggested amendments.

"""
@author: Milan Grujicic
"""

import requests
from bs4 import BeautifulSoup

urls = ['https://gigatron.rs/ssd/wd-ssd-green-series-wds480g2g0a-193671',
       'https://gigatron.rs/ssd/wd-ssd-blue-250gb-25-sata-iiiwds250g2b0a-250gb-25-sata-iii-do-550-mbs-125220',
       'https://gigatron.rs/ssd/silicon-power-ssd-512gb-25-sata-iii-ace-a55sp512gbss3a55s25-512gb-25-sata-iii-do-560-mbs-144553',
       'https://gigatron.rs/ssd/crucial-ssd-bx500-serijact120bx500ssd1-165010']

data = []    

for url in urls:
    page = requests.get(url)
    soup = BeautifulSoup(page.text, features='lxml')
    
    try:
        name = soup.find('h1', {'itemprop':'name'}).text
    except AttributeError:
        print('h1 tag with name does not exist')
    
    try:
        price = soup.find('span', {'itemprop':'price'}).text
    except AttributeError:
        print('Span tag with price does not exist')
    
    p = [name, price]
    data.append(p)

for products in data:
    print(products)

Now it displays a message should the tags are not found, besides other minor changes.

The last two itens have been puzzling me.

4) Based on what can I fetch nonexistent urls?
5) You mean each url in a specific thread? How can I retrieve the urls from the list to do so?

**buran** · (This post was last modified: Sep-22-2020, 05:23 PM by buran.)

note that if you hit one of except blocks you introduced you will get error if this is the first url (name and/or price will not be defined) or they will have incorrect value (from previous iteration of the loop)

Webscraper for multiple urls

User Panel Messages

Announcements