Bottom Page

Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
 Cannot extract data from the next pages
#1
Dear Members,

I am writing Python codes to extract EyeGlass listings at 'https://www.glassesshop.com/bestsellers'. The codes perfectly extract data from the first page and but fails to extract data from the next pages. There are in total 5 pages. I list both VS codes and Terminal report here. I highly appreciate your help.

# -*- coding: utf-8 -*-
import scrapy


class GlassSpider(scrapy.Spider):
    name = 'glass'
    allowed_domains = ['www.glassesshop.com']
    start_urls = ['https://www.glassesshop.com/bestsellers']

    def parse(self, response):
        names=response.xpath("//p[@class='pname col-sm-12']/a")
        for name in names:
            name_var=name.xpath(".//text()").get()
            link=name.xpath(".//@href").get()

            yield response.follow(url=link, callback=self.parse_glass, meta={'glass_name': name_var})

    def parse_glass(self, response):
        name_var=response.request.meta['glass_name']
        price=response.xpath("//span[@class='product-price-original']/text()").get()
        sku=response.xpath("//ul[@class='col-12 col-sm-6 default-content']/li[1]/text()").get()
        frame=response.xpath("//a[@class='col01']/text()").get()

        yield{
            'glass_name': name_var,
            'price': price,
            'sku': sku,
            'frame': frame
            }
        
        next_page = response.xpath("(//div[@class='custom-pagination']/ul/li)[7]/a/@href").get()
        
        if next_page:
            yield scrapy.Request(url=next_page, callback=self.parse)
Terminal Report:
Quote
#2
change start_urls to include all 5 pages

start_urls = [f'https://www.glassesshop.com/bestsellers?page={page}' for page in range(1, 6)]
Quote
#3
Thank you, buran, for your response. It works perfectly fine now. If you do not mind, could you please briefly explain the problem in the code. I believe I will learn from your explanation and in the future solve this sort of problem.
Quote
#4
well, I don't know what's there to explain. You have 2 levels of pages - the top 5 pages is the first level. When you parse these 5 pages you have all the urls of each individual product. The second levels is the each individual product page.
Your start_urls had only one of the 5 top level urls.

as explained in the docs, start_urls list is shortcut for start_requests method

def start_requests(self):
    for page in range(1, 6):
        yield scrapy.Request(url=f'https://www.glassesshop.com/bestsellers?page={page}', callback=self.parse)
Quote
#5
The explanation completely makes sense. Thank you, buran.
Quote

Top Page

Possibly Related Threads...
Thread Author Replies Views Last Post
  Extract data from a webpage cycloneseb 4 173 Nov-12-2019, 05:25 PM
Last Post: snippsat
  Extracting Headers from Many Pages Quickly OstermanA 2 237 Oct-01-2019, 08:01 AM
Last Post: OstermanA
  pagination for non standarded pages zarize 12 456 Sep-02-2019, 12:35 PM
Last Post: zarize
  How to use Python to extract data from Zoho Creator software on the web dan7055 2 531 Jul-05-2019, 05:11 PM
Last Post: DeaD_EyE
  Python/BeautiifulSoup. list of urls ->parse->extract data to csv. getting ERROR IanTheLMT 2 420 Jul-04-2019, 02:31 AM
Last Post: IanTheLMT
  Help to extract data from web prasadmathe 4 477 May-20-2019, 10:59 PM
Last Post: michalmonday
  Protected Pages with Django xxp2 2 389 Feb-12-2019, 07:28 PM
Last Post: xxp2
  [Python 3] - Extract specific data from a web page using lxml module Takeshio 9 1,604 Aug-25-2018, 08:46 AM
Last Post: leotrubach
  Scraping external URLs from pages Apook 5 1,272 Jul-18-2018, 06:42 PM
Last Post: nilamo
  scraping multiple pages of a website. Blue Dog 14 11,169 Jun-21-2018, 09:03 PM
Last Post: Blue Dog

Forum Jump:


Users browsing this thread: 1 Guest(s)