Python Forum
Code Help, web scraping non uniform lists(ul)
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Code Help, web scraping non uniform lists(ul)
#1
Hi,

I am writing this code for scraping off of a website into an excel spreadsheet, i am having an issue where the website doesn't use a list of the same length and so it means that I get an attribute error for the find_next function, wondering if anyone knows of a workaround.
My coding is a bit of mess

import requests
from bs4 import BeautifulSoup
import pandas as pd
page_number = 1

url = 'https://www.autotrader.co.uk/car-search?advertClassification=standard&postcode=la94py&onesearchad=Used&onesearchad=Nearly%20New&onesearchad=New&advertising-location=at_cars&is-quick-search=TRUE&include-delivery-option=on&page='

agent = {"User-Agent":'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36'}
car_spec = []
car_age = []
car_style = []
car_mileage =[]
car_engine_size =[]
car_BHP = []
price_lst = []
car_detail = []
car_gearbox_style = []
car_fuel_type = []
car_next=[]
while page_number < 100:
    all_car = []
    page_number += 1
    pg_no = str(page_number)
    print(page_number)
    url2= url+pg_no
    response = requests.get(url2, headers=agent)
    soup = BeautifulSoup(response.content, 'lxml')
    car_elements = soup.find_all('div', class_='product-card-content__car-info')
    for tag in car_elements:
        price = tag.find('div', class_='product-card-pricing__price')
        price_lst.append(price.text.strip())
    for tag in car_elements:
        car = tag.find('h3', class_='product-card-details__title')
        car_detail.append(car.text.strip())
    for tag in car_elements:
        car = tag.find('li', class_='atc-type-picanto--medium')
        if car is None:
            car='0'
            car_age.append(car)
        else:
            car_age.append(car.text)
        car = tag.find('li', class_='atc-type-picanto--medium').find_next('li', class_='atc-type-picanto--medium')
        if car is None:
            car='0'
            car_style.append(car)
        else:
            car_style.append(car.text)
        car = tag.find('li', class_ ='atc-type-picanto--medium').find_next('li', class_='atc-type-picanto--medium').find_next('li', class_='atc-type-picanto--medium')
        if car is None:
            car='0'
            car_mileage.append(car)
        else:
            car_mileage.append(car.text)
        car = tag.find('li',class_='atc-type-picanto--medium').find_next('li', class_='atc-type-picanto--medium').find_next('li', class_='atc-type-picanto--medium').find_next('li', class_='atc-type-picanto--medium')
        if car is None:
            car='0'
            car_engine_size.append(car)
        else:
            car_engine_size.append(car.text)
        car= tag.find('li', class_ ='atc-type-picanto--medium').find_next('li', class_='atc-type-picanto--medium').find_next('li', class_='atc-type-picanto--medium').find_next('li', class_='atc-type-picanto--medium').find_next('li', class_='atc-type-picanto--medium').find_next('li', class_='atc-type-picanto--medium')
        if car is None:
            car='0'
            car_gearbox_style.append(car)
        else:
            car_gearbox_style.append(car.text)
        car = tag.find('li', class_ ='atc-type-picanto--medium').find_next('li', class_='atc-type-picanto--medium').find_next('li', class_='atc-type-picanto--medium').find_next('li', class_='atc-type-picanto--medium').find_next('li', class_='atc-type-picanto--medium').find_next('li', class_='atc-type-picanto--medium').find_next('li', class_='atc-type-picanto--medium')
        if car is None:
            car='0'
            car_fuel_type.append(car)
        else:
            car_fuel_type.append(car.text)

    
    all_car = zip(car_detail, price_lst, car_age,car_style,car_mileage,car_engine_size,car_gearbox_style,car_fuel_type)

    
# Create the pandas DataFrame
df = pd.DataFrame(all_car)
df.to_excel("car_info.xlsx", index=False, sheet_name='car_info')
Reply
#2
My last post as reference.
Now you throw in a lot of stuff and try to that on 100 pages,take small step as was try pointing out in that post.
So if make a loop i would do it like this no while loop(rarely needed in Python).
import requests
from bs4 import BeautifulSoup
import pandas as pd

price_lst = []
car_detail = []
for page in range(1, 3):
    url = f'https://www.autotrader.co.uk/car-search?advertClassification=standard&postcode=la94py&onesearchad=Used&onesearchad=Nearly%20New&onesearchad=New&advertising-location=at_cars&is-quick-search=TRUE&include-delivery-option=on&page={page}'
    agent = {"User-Agent":'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36'}
    response = requests.get(url, headers=agent)
    soup = BeautifulSoup(response.content, 'lxml')
    car_elements = soup.find_all('div', class_='product-card-content__car-info')
    for tag in car_elements:
        price = tag.find('div', class_='product-card-pricing__price')
        price_lst.append(price.text.strip())
    for tag in car_elements:
        car = tag.find('h3', class_='product-card-details__title')
        car_detail.append(car.text.strip())

all_car = zip(car_detail, price_lst)
# Create the pandas DataFrame
df = pd.DataFrame(all_car, columns=['Name', 'price'])
df.to_excel("car_info.xlsx", index=False, sheet_name='car_info')
Output:
>>> df Name price 0 Ford Fiesta £10,789 1 Vauxhall Insignia £295 2 Saab 9-3 £1,995 3 Ford Fiesta £2,399 4 Renault Clio £1,999 ... ect
Now do loop and writing to Excel work with these 2 columns on 3 pages.
Then can try to add more car info,test on every step so you know when it goes wrong 🚩
buran likes this post
Reply
#3
I know where it goes wrong, so when the list on the website is shorter than say 6, the program looks for the sixth when there isn't one and therefore has a non-type error, is what I think is going wrong, or am I misunderstanding? I like the use of the rang function, simplifies the code a little. the error only occurs in large quantities of data, as with the more data you have the more likely it is there will be missing data. And for testing i did do the method you employed of piecewise adding a variable one at a time.
Reply
#4
It's common to get some error when try to scraping a lot of pages.
There can be different way handle it,can make an AttributeError error as a demo.
from bs4 import BeautifulSoup

html = '''\
<tr>
  <td id="BMW">Black color</td>
  <td>2014 model</td>
</tr>'''

soup = BeautifulSoup(html, 'lxml')
for td in soup.find('td', id="BMW").find_next('td'):
        print(td)
Output:
2014 model
So if change id to Lada🚗 get an AttributeError(some would say it's a big error😲),then can catch it with try: except.
from bs4 import BeautifulSoup

html = '''\
<tr>
  <td id="BMW">Black color</td>
  <td>2014 model</td>
</tr>'''

soup = BeautifulSoup(html, 'lxml')
try:
    for td in soup.find('td', id="Lada").find_next('td'):
        print(td)
except AttributeError:
    print('Got an error AttributeError')
    td = 'Dummy value'

print(td)
Output:
Got an error AttributeError Dummy value
Do nothing just skip error would be.
except AttributeError:
    pass
Reply
#5
Ahh yes this seems like exactly what i was looking for Angel Angel
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  scraping code misses listings kolarmi19 0 1,002 Jan-27-2023, 10:00 AM
Last Post: kolarmi19
  scraping code nexuz89 0 1,494 Sep-28-2020, 12:16 PM
Last Post: nexuz89
  In need of web scraping code! kolbyng 1 1,720 Sep-21-2020, 06:02 AM
Last Post: buran
  error in code web scraping alexisbrunaux 5 3,730 Aug-19-2020, 02:31 AM
Last Post: alexisbrunaux
  scraping from a website that hides source code PIWI_Protein 1 1,938 Mar-27-2020, 05:08 PM
Last Post: Larz60+
  Web Scraping, Merging two lists and getting data from various dates? AgileAVS 0 1,825 Feb-07-2020, 04:05 PM
Last Post: AgileAVS

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020