Code is not giving results

tahir1990 · (This post was last modified: Oct-18-2019, 01:26 AM by Larz60+.)

Hello
A beginner in web scrapping and python here Smile

I ve been trying to scrape a website using python and beautiful soup.
I have put the below code in cmd but its not fetching the results i.e h2 tags.
Please inspect the code and let me know if i committed any mistake.Thx.

import requests
from bs4 import BeautifulSoup
result = requests.get("https://www.daraz.pk/")
src = result.content
soup = BeautifulSoup(src, 'lxml')
urls = []
for h2_tag in soup.find_all("h2")
    a_tag = h2_tag.find('a')
    urls.append(a_tag.attrs['href'])
    print(urls)

**Larz60+** · Oct-18-2019, 01:53 AM

the only h2 tags in this script are:

Output:<div class="drz-footer-about">
    <h1 class="drz-footer-title">
        <h2 style="font-size: 14px; font-weight: 400; line-height: 20px; margin: 0 0 10px; color: #606060;">
            What Makes Us Different from Other Online Shopping Platforms?
        </h2>
    </h1>
</div>

this code will help you see where your links are located:

the following module will reformat html so easier to view
PrettifyPage.py (use this name as it will be imported)

# PrettifyPage.py

from bs4 import BeautifulSoup
import requests
import pathlib


class PrettifyPage:
    def __init__(self):
        pass

    def prettify(self, soup, indent):
        pretty_soup = str()
        previous_indent = 0
        for line in soup.prettify().split("\n"):
            current_indent = str(line).find("<")
            if current_indent == -1 or current_indent > previous_indent + 2:
                current_indent = previous_indent + 1
            previous_indent = current_indent
            pretty_soup += self.write_new_line(line, current_indent, indent)
        return pretty_soup

    def write_new_line(self, line, current_indent, desired_indent):
        new_line = ""
        spaces_to_add = (current_indent * desired_indent) - current_indent
        if spaces_to_add > 0:
            for i in range(spaces_to_add):
                new_line += " "		
        new_line += str(line) + "\n"
        return new_line

if __name__ == '__main__':
    pp = PrettifyPage()

This will fetch page:

import requests
from bs4 import BeautifulSoup

def fetch_url(url, debug=False):
    if debug:
        import PrettifyPage
        pp = PrettifyPage.PrettifyPage()
        df = open('daraz_pretty.html', 'w')

    result = requests.get(url)
    if result.status_code == 200:
        src = result.content
        soup = BeautifulSoup(src, 'lxml')
        if debug:
            df.write(pp.prettify(soup, 2))
    else:
        print(f"Unable to load: {url}")

if __name__ == '__main__':
    fetch_url('https://www.daraz.pk/', debug=True)

tahir1990 · Oct-18-2019, 11:44 AM

(Oct-18-2019, 01:53 AM)Larz60+ Wrote: the only h2 tags in this script are:

Output:<div class="drz-footer-about">
    <h1 class="drz-footer-title">
        <h2 style="font-size: 14px; font-weight: 400; line-height: 20px; margin: 0 0 10px; color: #606060;">
            What Makes Us Different from Other Online Shopping Platforms?
        </h2>
    </h1>
</div>

this code will help you see where your links are located:

the following module will reformat html so easier to view
PrettifyPage.py (use this name as it will be imported)

# PrettifyPage.py

from bs4 import BeautifulSoup
import requests
import pathlib


class PrettifyPage:
    def __init__(self):
        pass

    def prettify(self, soup, indent):
        pretty_soup = str()
        previous_indent = 0
        for line in soup.prettify().split("\n"):
            current_indent = str(line).find("<")
            if current_indent == -1 or current_indent > previous_indent + 2:
                current_indent = previous_indent + 1
            previous_indent = current_indent
            pretty_soup += self.write_new_line(line, current_indent, indent)
        return pretty_soup

    def write_new_line(self, line, current_indent, desired_indent):
        new_line = ""
        spaces_to_add = (current_indent * desired_indent) - current_indent
        if spaces_to_add > 0:
            for i in range(spaces_to_add):
                new_line += " "		
        new_line += str(line) + "\n"
        return new_line

if __name__ == '__main__':
    pp = PrettifyPage()

This will fetch page:

import requests
from bs4 import BeautifulSoup

def fetch_url(url, debug=False):
    if debug:
        import PrettifyPage
        pp = PrettifyPage.PrettifyPage()
        df = open('daraz_pretty.html', 'w')

    result = requests.get(url)
    if result.status_code == 200:
        src = result.content
        soup = BeautifulSoup(src, 'lxml')
        if debug:
            df.write(pp.prettify(soup, 2))
    else:
        print(f"Unable to load: {url}")

if __name__ == '__main__':
    fetch_url('https://www.daraz.pk/', debug=True)

First of all why my code didnt work?
Secondly i saved the file prettifypage.py and then run the second code you wrote through python IDLE.Its giving me this error.

Traceback (most recent call last):
  File "C:\Users\SoftLand PC\Desktop\python2.py", line 20, in <module>
    fetch_url('https://www.daraz.pk/', debug=True)
  File "C:\Users\SoftLand PC\Desktop\python2.py", line 15, in fetch_url
    df.write(pp.prettify(soup, 2))
  File "C:\Users\SoftLand PC\AppData\Local\Programs\Python\Python37-32\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2713' in position 369: character maps to <undefined>

***snippsat*** · Oct-18-2019, 12:18 PM

(Oct-18-2019, 11:44 AM)tahir1990 Wrote: First of all why my code didnt work?

It will never work because of JavaScript,just turn of JavaScript in browser and see what you get.
Need to use other tools like Selenium .
Quick demo get link under Flash Sale,need also to make root-url + href to use links as only get href back.

from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.keys import Keys

#--| Setup
options = Options()
#options.add_argument("--headless")
browser = webdriver.Chrome(executable_path=r'chromedriver.exe', options=options)
#--| Parse or automation
browser.get('https://www.daraz.pk/')
browser.implicitly_wait(3)
soup = BeautifulSoup(browser.page_source, 'lxml')
flash_sale = soup.find('div', class_="card-fs-content-body J_FSBody")
for link in flash_sale.find_all('a'):
    print(link['href'])

Output://www.daraz.pk/products/magnetic-bluetooth-wireless-stereo-in-ear-sports-handfree-bluetooth-handfree-handsfree-i117746320-s1270732052.html?search=1&mp=1&c=fs
//www.daraz.pk/products/flawless-women-painless-hair-remover-face-facial-hair-remover-i119164615-s1272776495.html?search=1&mp=1&c=fs
//www.daraz.pk/products/brand-new-in-ear-woofer-headphones-super-basser-multicolor-so-i123712768-s1280896199.html?search=1&mp=1&c=fs
//www.daraz.pk/products/windows-10-pro-2019-version-1903-fully-updated-64-bit-dvd-i114684527-s1266482147.html?search=1&mp=1&c=fs
//www.daraz.pk/products/high-quality-tummy-trimmer-black-and-silver-exercise-machine-i118616642-s1271900650.html?search=1&mp=1&c=fs
//www.daraz.pk/products/genuine-leather-card-holder-i121356181-s1276938197.html?search=1&mp=1&c=fs

More about it here Web-scraping part-2.

tahir1990 · Oct-18-2019, 12:48 PM

(Oct-18-2019, 12:18 PM)snippsat Wrote:

(Oct-18-2019, 11:44 AM)tahir1990 Wrote: First of all why my code didnt work?

It will never work because of JavaScript,just turn of JavaScript in browser and see what you get.
Need to use other tools like Selenium .
Quick demo get link under Flash Sale,need also to make root-url + href to use links as only get href back.

from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.keys import Keys

#--| Setup
options = Options()
#options.add_argument("--headless")
browser = webdriver.Chrome(executable_path=r'chromedriver.exe', options=options)
#--| Parse or automation
browser.get('https://www.daraz.pk/')
browser.implicitly_wait(3)
soup = BeautifulSoup(browser.page_source, 'lxml')
flash_sale = soup.find('div', class_="card-fs-content-body J_FSBody")
for link in flash_sale.find_all('a'):
    print(link['href'])

Output://www.daraz.pk/products/magnetic-bluetooth-wireless-stereo-in-ear-sports-handfree-bluetooth-handfree-handsfree-i117746320-s1270732052.html?search=1&mp=1&c=fs
//www.daraz.pk/products/flawless-women-painless-hair-remover-face-facial-hair-remover-i119164615-s1272776495.html?search=1&mp=1&c=fs
//www.daraz.pk/products/brand-new-in-ear-woofer-headphones-super-basser-multicolor-so-i123712768-s1280896199.html?search=1&mp=1&c=fs
//www.daraz.pk/products/windows-10-pro-2019-version-1903-fully-updated-64-bit-dvd-i114684527-s1266482147.html?search=1&mp=1&c=fs
//www.daraz.pk/products/high-quality-tummy-trimmer-black-and-silver-exercise-machine-i118616642-s1271900650.html?search=1&mp=1&c=fs
//www.daraz.pk/products/genuine-leather-card-holder-i121356181-s1276938197.html?search=1&mp=1&c=fs

More about it here Web-scraping part-2.

Oh got it thanks a lot buddy.
Sometimes html and css gets complex.What i learned some youtube tutorials were simple web page examples.I failed to realize the online shopping website i am trying to learn web scrapping on is not simple web page.I think i should first start with simple looking web pages,right?

***snippsat*** · (This post was last modified: Oct-18-2019, 01:21 PM by snippsat.)

(Oct-18-2019, 12:48 PM)tahir1990 Wrote: I think i should first start with simple looking web pages,right?

It's okay to get the basic understating first train with different pages(or just raw html),
as you see i can scrape that page you have with not so much code.
Can look this Web-Scraping part-1.

When i say raw html,so is it just html you make or copy in.

from bs4 import BeautifulSoup
 
# Simulate a web page
html = '''\
<html>
  <head>
     <title>My Site</title>
  </head>
  <body>
     <title>First chapter</title>
     <p>Page1</p>
     <p>Page2</p>
  </body>
</html>'''
 
soup = BeautifulSoup(html, 'html.parser')
print(soup.find('title'))
print(soup.find_all('title'))

Output:<title>My Site</title>
[<title>My Site</title>, <title>First chapter</title>]

# Iterative testing 
>>> soup.find_all('p')
[<p>Page1</p>, <p>Page2</p>]
>>> 
>>> # Using CSS selector 
>>> soup.select('body > p')
[<p>Page1</p>, <p>Page2</p>]

>>> soup.select_one('body > p').text
'Page1'

Code is not giving results

User Panel Messages

Announcements