Python Forum

I'm trying to build a scraper to get pricing and description from this site, just for the men's shoes.
When you visit the site normally via a browser, the page loads, but then some sort of "processing" activity occurs which makes the page inaccessible for 2 or 3 seconds, then you scroll and click on anything, but this activity occurs on every page as you navigate the page results. Doesn't seem to happen on the individual detail pages.

Anyway, it seems that in order to get the below code, im thinking that some sort of delay will need to be added at some point in order to allow the page to load and let that process run, then try to access and scrape the page. I ran the exact same code on another shoe site and it processed roughly 230 results in about 30 seconds.

import requests
from bs4 import BeautifulSoup

# https://www.dickssportinggoods.com/f/all-mens-footwear?pageNumber=0 this is Page 1
    
    response = requests.get('https://www.dickssportinggoods.com/f/all-mens-footwear?pageNumber=2')
    soup = BeautifulSoup(response.content,'lxml')
    
    productdetails = []

for x in range(0,87):
    response = requests.get(f'https://www.dickssportinggoods.com/f/all-mens-footwear?pageNumber={x}')
    soup = BeautifulSoup(response.content,'lxml')

    element_list = soup.find_all('div',class_='product-content')
    for element in element_list:
        for link in element.find_all('a', class_='product-card-simple-title'):
            print("Description: " + link.get_text().strip())
            productdetails.append("Description: " + link.get_text().strip())
            for price in element.find_all('span',class_='sr-only'):
                print("Price: " + price.get_text().strip().replace('\n', '').replace(' ','').replace('dollars','.').replace('cents',''))
                productdetails.append("Price: " + price.get_text().strip().replace('\n', '').replace(' ','').replace('dollars','.').replace('cents',''))

Any update?

for precise control, I would recommend selenium. see this.

Building a scraper can be challenging, especially when dealing with dynamic websites that have loading delays or use AJAX to load content. Given the "processing" activity you described, it sounds like the website might be using some form of lazy loading or client-side rendering. When considering mobile app development in ... it's essential to be aware of such intricacies, as they can affect the user experience and the efficiency of data retrieval.

To handle this in your scraper, you might need to use a tool like Selenium or Puppeteer, which allows for browser automation. These tools can mimic real user interactions, like waiting for a page to load fully before scraping the content.

Here's a basic approach using Selenium:

`python
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# Initialize the browser
driver = webdriver.Chrome()

# Navigate to the website
driver.get('URL_OF_THE_WEBSITE')

# Wait for the "processing" activity to complete
wait = WebDriverWait(driver, 10) # wait for up to 10 seconds
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, 'CSS_SELECTOR_OF_AN_ELEMENT_YOU_WANT_TO_WAIT_FOR')))

# Now, you can scrape the content
content = driver.page_source

# Don't forget to close the browser once done
driver.quit()
`

Remember to replace 'URL_OF_THE_WEBSITE' with the actual URL and 'CSS_SELECTOR_OF_AN_ELEMENT_YOU_WANT_TO_WAIT_FOR' with a CSS selector of an element you know will be present after the "processing" activity.

Also, when building or using scrapers, always ensure you're respecting the website's robots.txt file and terms of service. Some sites might have restrictions against scraping, and you wouldn't want to inadvertently violate any terms.

So soon after posting my initial question above, i went and took a few days off, so no new updates yet. BUT i will be working on this again this week.. thank you for the suggestions and will def work with the above sample to see if i can understand and work with Selenium.

You mentioned the robots.txt file.. is that something that has to be written into the python code logic? or is that more of a read that file to make sure of any restrictions?

robots.text will be found in the root page of a website.
In your instance it can be found here.

Yes, it is possible to add a delay right after making a request using the requests.get() function in Python. Adding a delay can be useful in situations where you want to control the rate at which you make requests to a web server, especially if you're scraping data or interacting with a web API that has rate-limiting policies.

You can introduce a delay using the time.sleep() function from the time module. Here's an example of how to add a delay of, let's say, 2 seconds after making a GET request:

import requests
import time

url = "https://example.com"
response = requests.get(url)

# Add a 2-second delay
time.sleep(2)

# Continue with your code after the delay

In this example, the time.sleep(2) line will pause the execution of your script for 2 seconds before moving on to the next line of code. You can adjust the delay duration by changing the argument to time.sleep() to suit your needs.

Keep in mind that while adding a delay can help you avoid overloading a server or violating rate limits, it will also make your script run slower. Balancing the delay duration is important to achieve the desired rate of request while ensuring your script remains efficient.

cubangt

KevinGomez

Larz60+

alan_a

cubangt

Larz60+

shoesinquiry