Python Forum

Full Version: Cannot extract data from the next pages
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Dear Members,

I am writing Python codes to extract EyeGlass listings at 'https://www.glassesshop.com/bestsellers'. The codes perfectly extract data from the first page and but fails to extract data from the next pages. There are in total 5 pages. I list both VS codes and Terminal report here. I highly appreciate your help.

# -*- coding: utf-8 -*-
import scrapy


class GlassSpider(scrapy.Spider):
    name = 'glass'
    allowed_domains = ['www.glassesshop.com']
    start_urls = ['https://www.glassesshop.com/bestsellers']

    def parse(self, response):
        names=response.xpath("//p[@class='pname col-sm-12']/a")
        for name in names:
            name_var=name.xpath(".//text()").get()
            link=name.xpath(".//@href").get()

            yield response.follow(url=link, callback=self.parse_glass, meta={'glass_name': name_var})

    def parse_glass(self, response):
        name_var=response.request.meta['glass_name']
        price=response.xpath("//span[@class='product-price-original']/text()").get()
        sku=response.xpath("//ul[@class='col-12 col-sm-6 default-content']/li[1]/text()").get()
        frame=response.xpath("//a[@class='col01']/text()").get()

        yield{
            'glass_name': name_var,
            'price': price,
            'sku': sku,
            'frame': frame
            }
        
        next_page = response.xpath("(//div[@class='custom-pagination']/ul/li)[7]/a/@href").get()
        
        if next_page:
            yield scrapy.Request(url=next_page, callback=self.parse)
Terminal Report:
change start_urls to include all 5 pages

start_urls = [f'https://www.glassesshop.com/bestsellers?page={page}' for page in range(1, 6)]
Thank you, buran, for your response. It works perfectly fine now. If you do not mind, could you please briefly explain the problem in the code. I believe I will learn from your explanation and in the future solve this sort of problem.
well, I don't know what's there to explain. You have 2 levels of pages - the top 5 pages is the first level. When you parse these 5 pages you have all the urls of each individual product. The second levels is the each individual product page.
Your start_urls had only one of the 5 top level urls.

as explained in the docs, start_urls list is shortcut for start_requests method

def start_requests(self):
    for page in range(1, 6):
        yield scrapy.Request(url=f'https://www.glassesshop.com/bestsellers?page={page}', callback=self.parse)
The explanation completely makes sense. Thank you, buran.