Python Forum
Cannot extract data from the next pages - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html)
+--- Thread: Cannot extract data from the next pages (/thread-22409.html)



Cannot extract data from the next pages - nazmulfinance - Nov-11-2019

Dear Members,

I am writing Python codes to extract EyeGlass listings at 'https://www.glassesshop.com/bestsellers'. The codes perfectly extract data from the first page and but fails to extract data from the next pages. There are in total 5 pages. I list both VS codes and Terminal report here. I highly appreciate your help.

# -*- coding: utf-8 -*-
import scrapy


class GlassSpider(scrapy.Spider):
    name = 'glass'
    allowed_domains = ['www.glassesshop.com']
    start_urls = ['https://www.glassesshop.com/bestsellers']

    def parse(self, response):
        names=response.xpath("//p[@class='pname col-sm-12']/a")
        for name in names:
            name_var=name.xpath(".//text()").get()
            link=name.xpath(".//@href").get()

            yield response.follow(url=link, callback=self.parse_glass, meta={'glass_name': name_var})

    def parse_glass(self, response):
        name_var=response.request.meta['glass_name']
        price=response.xpath("//span[@class='product-price-original']/text()").get()
        sku=response.xpath("//ul[@class='col-12 col-sm-6 default-content']/li[1]/text()").get()
        frame=response.xpath("//a[@class='col01']/text()").get()

        yield{
            'glass_name': name_var,
            'price': price,
            'sku': sku,
            'frame': frame
            }
        
        next_page = response.xpath("(//div[@class='custom-pagination']/ul/li)[7]/a/@href").get()
        
        if next_page:
            yield scrapy.Request(url=next_page, callback=self.parse)
Terminal Report:



RE: Cannot extract data from the next pages - buran - Nov-11-2019

change start_urls to include all 5 pages

start_urls = [f'https://www.glassesshop.com/bestsellers?page={page}' for page in range(1, 6)]



RE: Cannot extract data from the next pages - nazmulfinance - Nov-11-2019

Thank you, buran, for your response. It works perfectly fine now. If you do not mind, could you please briefly explain the problem in the code. I believe I will learn from your explanation and in the future solve this sort of problem.


RE: Cannot extract data from the next pages - buran - Nov-11-2019

well, I don't know what's there to explain. You have 2 levels of pages - the top 5 pages is the first level. When you parse these 5 pages you have all the urls of each individual product. The second levels is the each individual product page.
Your start_urls had only one of the 5 top level urls.

as explained in the docs, start_urls list is shortcut for start_requests method

def start_requests(self):
    for page in range(1, 6):
        yield scrapy.Request(url=f'https://www.glassesshop.com/bestsellers?page={page}', callback=self.parse)



RE: Cannot extract data from the next pages - nazmulfinance - Nov-11-2019

The explanation completely makes sense. Thank you, buran.