Code Help, web scraping non uniform lists(ul)

luke_m · Apr-21-2021, 04:38 PM

Hi,

I am writing this code for scraping off of a website into an excel spreadsheet, i am having an issue where the website doesn't use a list of the same length and so it means that I get an attribute error for the find_next function, wondering if anyone knows of a workaround.
My coding is a bit of mess

import requests
from bs4 import BeautifulSoup
import pandas as pd
page_number = 1

url = 'https://www.autotrader.co.uk/car-search?advertClassification=standard&postcode=la94py&onesearchad=Used&onesearchad=Nearly%20New&onesearchad=New&advertising-location=at_cars&is-quick-search=TRUE&include-delivery-option=on&page='

agent = {"User-Agent":'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36'}
car_spec = []
car_age = []
car_style = []
car_mileage =[]
car_engine_size =[]
car_BHP = []
price_lst = []
car_detail = []
car_gearbox_style = []
car_fuel_type = []
car_next=[]
while page_number < 100:
    all_car = []
    page_number += 1
    pg_no = str(page_number)
    print(page_number)
    url2= url+pg_no
    response = requests.get(url2, headers=agent)
    soup = BeautifulSoup(response.content, 'lxml')
    car_elements = soup.find_all('div', class_='product-card-content__car-info')
    for tag in car_elements:
        price = tag.find('div', class_='product-card-pricing__price')
        price_lst.append(price.text.strip())
    for tag in car_elements:
        car = tag.find('h3', class_='product-card-details__title')
        car_detail.append(car.text.strip())
    for tag in car_elements:
        car = tag.find('li', class_='atc-type-picanto--medium')
        if car is None:
            car='0'
            car_age.append(car)
        else:
            car_age.append(car.text)
        car = tag.find('li', class_='atc-type-picanto--medium').find_next('li', class_='atc-type-picanto--medium')
        if car is None:
            car='0'
            car_style.append(car)
        else:
            car_style.append(car.text)
        car = tag.find('li', class_ ='atc-type-picanto--medium').find_next('li', class_='atc-type-picanto--medium').find_next('li', class_='atc-type-picanto--medium')
        if car is None:
            car='0'
            car_mileage.append(car)
        else:
            car_mileage.append(car.text)
        car = tag.find('li',class_='atc-type-picanto--medium').find_next('li', class_='atc-type-picanto--medium').find_next('li', class_='atc-type-picanto--medium').find_next('li', class_='atc-type-picanto--medium')
        if car is None:
            car='0'
            car_engine_size.append(car)
        else:
            car_engine_size.append(car.text)
        car= tag.find('li', class_ ='atc-type-picanto--medium').find_next('li', class_='atc-type-picanto--medium').find_next('li', class_='atc-type-picanto--medium').find_next('li', class_='atc-type-picanto--medium').find_next('li', class_='atc-type-picanto--medium').find_next('li', class_='atc-type-picanto--medium')
        if car is None:
            car='0'
            car_gearbox_style.append(car)
        else:
            car_gearbox_style.append(car.text)
        car = tag.find('li', class_ ='atc-type-picanto--medium').find_next('li', class_='atc-type-picanto--medium').find_next('li', class_='atc-type-picanto--medium').find_next('li', class_='atc-type-picanto--medium').find_next('li', class_='atc-type-picanto--medium').find_next('li', class_='atc-type-picanto--medium').find_next('li', class_='atc-type-picanto--medium')
        if car is None:
            car='0'
            car_fuel_type.append(car)
        else:
            car_fuel_type.append(car.text)

    
    all_car = zip(car_detail, price_lst, car_age,car_style,car_mileage,car_engine_size,car_gearbox_style,car_fuel_type)

    
# Create the pandas DataFrame
df = pd.DataFrame(all_car)
df.to_excel("car_info.xlsx", index=False, sheet_name='car_info')

***snippsat*** · (This post was last modified: Apr-22-2021, 12:11 AM by snippsat.)

My last post as reference.
Now you throw in a lot of stuff and try to that on 100 pages,take small step as was try pointing out in that post.
So if make a loop i would do it like this no while loop(rarely needed in Python).

import requests
from bs4 import BeautifulSoup
import pandas as pd

price_lst = []
car_detail = []
for page in range(1, 3):
    url = f'https://www.autotrader.co.uk/car-search?advertClassification=standard&postcode=la94py&onesearchad=Used&onesearchad=Nearly%20New&onesearchad=New&advertising-location=at_cars&is-quick-search=TRUE&include-delivery-option=on&page={page}'
    agent = {"User-Agent":'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36'}
    response = requests.get(url, headers=agent)
    soup = BeautifulSoup(response.content, 'lxml')
    car_elements = soup.find_all('div', class_='product-card-content__car-info')
    for tag in car_elements:
        price = tag.find('div', class_='product-card-pricing__price')
        price_lst.append(price.text.strip())
    for tag in car_elements:
        car = tag.find('h3', class_='product-card-details__title')
        car_detail.append(car.text.strip())

all_car = zip(car_detail, price_lst)
# Create the pandas DataFrame
df = pd.DataFrame(all_car, columns=['Name', 'price'])
df.to_excel("car_info.xlsx", index=False, sheet_name='car_info')

Output:>>> df
                        Name    price
0                Ford Fiesta  £10,789
1          Vauxhall Insignia     £295
2                   Saab 9-3   £1,995
3                Ford Fiesta   £2,399
4               Renault Clio   £1,999
... ect

Now do loop and writing to Excel work with these 2 columns on 3 pages.
Then can try to add more car info,test on every step so you know when it goes wrong 🚩

luke_m · (This post was last modified: Apr-22-2021, 10:15 AM by luke_m.)

I know where it goes wrong, so when the list on the website is shorter than say 6, the program looks for the sixth when there isn't one and therefore has a non-type error, is what I think is going wrong, or am I misunderstanding? I like the use of the rang function, simplifies the code a little. the error only occurs in large quantities of data, as with the more data you have the more likely it is there will be missing data. And for testing i did do the method you employed of piecewise adding a variable one at a time.

***snippsat*** · Apr-22-2021, 11:45 AM

It's common to get some error when try to scraping a lot of pages.
There can be different way handle it,can make an AttributeError error as a demo.

from bs4 import BeautifulSoup

html = '''\
<tr>
  <td id="BMW">Black color</td>
  <td>2014 model</td>
</tr>'''

soup = BeautifulSoup(html, 'lxml')
for td in soup.find('td', id="BMW").find_next('td'):
        print(td)

Output:
2014 model

So if change id to Lada🚗 get an AttributeError(some would say it's a big error😲),then can catch it with try: except.

from bs4 import BeautifulSoup

html = '''\
<tr>
  <td id="BMW">Black color</td>
  <td>2014 model</td>
</tr>'''

soup = BeautifulSoup(html, 'lxml')
try:
    for td in soup.find('td', id="Lada").find_next('td'):
        print(td)
except AttributeError:
    print('Got an error AttributeError')
    td = 'Dummy value'

print(td)

Output:Got an error AttributeError
Dummy value

Do nothing just skip error would be.

except AttributeError:
    pass

luke_m · Apr-22-2021, 05:16 PM

Ahh yes this seems like exactly what i was looking for Angel

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	scraping code misses listings	kolarmi19	0	1,856	Jan-27-2023, 10:00 AM Last Post: kolarmi19
	scraping code	nexuz89	0	2,013	Sep-28-2020, 12:16 PM Last Post: nexuz89
	In need of web scraping code!	kolbyng	1	2,384	Sep-21-2020, 06:02 AM Last Post: buran
	error in code web scraping	alexisbrunaux	5	5,104	Aug-19-2020, 02:31 AM Last Post: alexisbrunaux
	scraping from a website that hides source code	PIWI_Protein	1	2,736	Mar-27-2020, 05:08 PM Last Post: Larz60+
	Web Scraping, Merging two lists and getting data from various dates?	AgileAVS	0	2,437	Feb-07-2020, 04:05 PM Last Post: AgileAVS

Code Help, web scraping non uniform lists(ul)

User Panel Messages

Announcements