Python Forum
Unable to fetch product url using BeautifulSoup with Python3.6
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Unable to fetch product url using BeautifulSoup with Python3.6
#1
Hi Expert,

I have fetched data from html using below code-
def get_soup(url):
    response = requests.get(url)
    html = response.content
    return BeautifulSoup(html, "html.parser")
And I have fecthed catagory url with-
def get_category_urls(url):
    soup = get_soup(url)
    cat_urls = []
    try:
        categories = soup.find('div', attrs={'id': 'menu_oc'})
        if categories is not None:
            for c in categories.findAll('a'):
                if c['href'] is not None:
                    cat_urls.append(c['href'])
    except Exception as exc:
        print("error::" + url + str(exc))
    finally:
        return cat_urls
Now I am trying to fetch product urls with below code-
def get_product_urls(url):
    soup = get_soup(url)
    prod_urls = []
    try:
        if soup.find('div', attrs={'class': 'pagination'}):
            pages = soup.find('div', attrs={'class': 'page'}).text.split("of ", 1)[1].replace(' (1 Pages)','')
            if pages is not None:
                for page in range(1, int(pages) + 1):
                    soup_with_page = get_soup(url + "&page={}".format(page))
                    product_urls_soup = soup_with_page.find('div', attrs={'id': 'carousel-featured-0'})
                    if product_urls_soup is not None:
                        for row in product_urls_soup.findAll('a'):
                            if row['href'] is not None:
                                prod_urls.append(row['href'])
    except Exception as exc:
        print("error:: " + prod_urls + ": " + str(exc))
    finally:
        return prod_urls
if __name__ == '__main__':
    with Pool(2) as p:
        product_urls = p.map(get_product_urls, category_urls)
    product_urls = list(filter(None, product_urls))
    product_urls_flat = list(set([y for x in product_urls for y in x]))
I am getting product_urls_soup as None here, what I am doing wrong here? PFB sample html data-

html data

How to handle pagination here since some categoroies have pagination and some have not?

Finally I got the issue.
I was not checking pagination for all categories and that's why getting problem.
Now I am able to solve the issue by putting a check for pagination.
Reply
#2
what is the original url?
Reply
#3
Hi All,

I am trying to scrape data from a site and able to fetch category urls with below loc-

def get_soup(url):
    soup = None
    try:
        response = requests.get(url)
        if response.status_code == 200:
            html = response.content
            soup = BeautifulSoup(html, "html.parser")
    except Exception as exc:
        print("error::", str(exc))
    finally:
        return soup

def get_category_urls(url):
    soup = get_soup(url)
    cat_urls = []
    try:
        categories = soup.find('div', attrs={'id': 'menu_oc'})
        if categories is not None:
            for c in categories.findAll('a'):
                if c['href'] is not None:
                    cat_urls.append(c['href'])
    except Exception as exc:
        print("error..", str(exc))
    finally:
        print("category urls::", cat_urls)
        return cat_urls
Now issue is with fetching the product urls because I have to fetch all product urls from each category (pagination+without pagination) and thus I am not able to proceed.

Can anyone please help me to write a function to get the product urls?
Reply
#4
please post enough code to run without work, it forces us to improvise, perhaps causing differnet results:

My attempt:
from bs4 import BeautifulSoup
import requests


def get_soup(url):
    soup = None
    try:
        response = requests.get(url, headers=headers, timeout=timeout)
        if response.status_code == 200:
            html = response.content
            soup = BeautifulSoup(html, "html.parser")
    except Exception as exc:
        print("error::", str(exc))
    finally:
        return soup
 
def get_category_urls(url):
    soup = get_soup(url)
    cat_urls = []
    try:
        categories = soup.find('div', attrs={'id': 'menu_oc'})
        if categories is not None:
            for c in categories.findAll('a'):
                if c['href'] is not None:
                    cat_urls.append(c['href'])
    except Exception as exc:
        print("error..", str(exc))
    finally:
        print("category urls::", cat_urls)
        return cat_urls

def main():
    url = 'http://www.infantree.net/shop/'
    soup = get_soup(url)

if __name__ == '__main__':
    main()
Error:
error:: name 'headers' is not defined
Reply
#5
I have edited the code, please recheck!
Reply
#6
Turn of JavaScript in browser and see how many product url's you see Think
The same as i just posted in this thread
Reply
#7
issue is resolved now.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Unable to convert browser generated xml to parse in BeautifulSoup Nik1811 0 126 Mar-22-2024, 01:37 PM
Last Post: Nik1811
  All product links to products on a website MarionStorm 0 1,057 Jun-02-2022, 11:17 PM
Last Post: MarionStorm
  Help with python3 (BeautifulSoup) freaknez 1 2,940 Sep-14-2018, 09:50 PM
Last Post: Larz60+
  My Django 2.0.6 logging is not working while product merging PrateekG 0 2,115 Jul-26-2018, 02:24 PM
Last Post: PrateekG
  Need help to get product details using BeautifulSoup+Python3.6! PrateekG 2 2,838 Jun-27-2018, 08:52 AM
Last Post: PrateekG
  How to fetch latitude,longitude from location and save them separately in db(Django2) PrateekG 0 2,611 Jun-21-2018, 04:40 AM
Last Post: PrateekG
  Getting 'list index out of range' while fetching product details using BeautifulSoup? PrateekG 8 8,048 Jun-06-2018, 12:15 PM
Last Post: snippsat
  Not able to fetch data from a webpage sumandas89 3 4,703 Dec-21-2017, 08:30 AM
Last Post: sumandas89
  How do I fetch values from db to Select Options using Flask? progShubham 2 17,641 Jul-25-2017, 05:52 PM
Last Post: nilamo

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020