Posts: 8
Threads: 2
Joined: Oct 2019
Oct-17-2019, 05:38 PM
(This post was last modified: Oct-18-2019, 01:26 AM by Larz60+.)
Hello
A beginner in web scrapping and python here
I ve been trying to scrape a website using python and beautiful soup.
I have put the below code in cmd but its not fetching the results i.e h2 tags.
Please inspect the code and let me know if i committed any mistake.Thx.
import requests
from bs4 import BeautifulSoup
result = requests.get("https://www.daraz.pk/")
src = result.content
soup = BeautifulSoup(src, 'lxml')
urls = []
for h2_tag in soup.find_all("h2")
a_tag = h2_tag.find('a')
urls.append(a_tag.attrs['href'])
print(urls)
Posts: 12,031
Threads: 485
Joined: Sep 2016
the only h2 tags in this script are:
Output: <div class="drz-footer-about">
<h1 class="drz-footer-title">
<h2 style="font-size: 14px; font-weight: 400; line-height: 20px; margin: 0 0 10px; color: #606060;">
What Makes Us Different from Other Online Shopping Platforms?
</h2>
</h1>
</div>
this code will help you see where your links are located:
the following module will reformat html so easier to view
PrettifyPage.py (use this name as it will be imported)
# PrettifyPage.py
from bs4 import BeautifulSoup
import requests
import pathlib
class PrettifyPage:
def __init__(self):
pass
def prettify(self, soup, indent):
pretty_soup = str()
previous_indent = 0
for line in soup.prettify().split("\n"):
current_indent = str(line).find("<")
if current_indent == -1 or current_indent > previous_indent + 2:
current_indent = previous_indent + 1
previous_indent = current_indent
pretty_soup += self.write_new_line(line, current_indent, indent)
return pretty_soup
def write_new_line(self, line, current_indent, desired_indent):
new_line = ""
spaces_to_add = (current_indent * desired_indent) - current_indent
if spaces_to_add > 0:
for i in range(spaces_to_add):
new_line += " "
new_line += str(line) + "\n"
return new_line
if __name__ == '__main__':
pp = PrettifyPage() This will fetch page:
import requests
from bs4 import BeautifulSoup
def fetch_url(url, debug=False):
if debug:
import PrettifyPage
pp = PrettifyPage.PrettifyPage()
df = open('daraz_pretty.html', 'w')
result = requests.get(url)
if result.status_code == 200:
src = result.content
soup = BeautifulSoup(src, 'lxml')
if debug:
df.write(pp.prettify(soup, 2))
else:
print(f"Unable to load: {url}")
if __name__ == '__main__':
fetch_url('https://www.daraz.pk/', debug=True)
Posts: 8
Threads: 2
Joined: Oct 2019
(Oct-18-2019, 01:53 AM)Larz60+ Wrote: the only h2 tags in this script are:
Output: <div class="drz-footer-about">
<h1 class="drz-footer-title">
<h2 style="font-size: 14px; font-weight: 400; line-height: 20px; margin: 0 0 10px; color: #606060;">
What Makes Us Different from Other Online Shopping Platforms?
</h2>
</h1>
</div>
this code will help you see where your links are located:
the following module will reformat html so easier to view
PrettifyPage.py (use this name as it will be imported)
# PrettifyPage.py
from bs4 import BeautifulSoup
import requests
import pathlib
class PrettifyPage:
def __init__(self):
pass
def prettify(self, soup, indent):
pretty_soup = str()
previous_indent = 0
for line in soup.prettify().split("\n"):
current_indent = str(line).find("<")
if current_indent == -1 or current_indent > previous_indent + 2:
current_indent = previous_indent + 1
previous_indent = current_indent
pretty_soup += self.write_new_line(line, current_indent, indent)
return pretty_soup
def write_new_line(self, line, current_indent, desired_indent):
new_line = ""
spaces_to_add = (current_indent * desired_indent) - current_indent
if spaces_to_add > 0:
for i in range(spaces_to_add):
new_line += " "
new_line += str(line) + "\n"
return new_line
if __name__ == '__main__':
pp = PrettifyPage() This will fetch page:
import requests
from bs4 import BeautifulSoup
def fetch_url(url, debug=False):
if debug:
import PrettifyPage
pp = PrettifyPage.PrettifyPage()
df = open('daraz_pretty.html', 'w')
result = requests.get(url)
if result.status_code == 200:
src = result.content
soup = BeautifulSoup(src, 'lxml')
if debug:
df.write(pp.prettify(soup, 2))
else:
print(f"Unable to load: {url}")
if __name__ == '__main__':
fetch_url('https://www.daraz.pk/', debug=True)
First of all why my code didnt work?
Secondly i saved the file prettifypage.py and then run the second code you wrote through python IDLE.Its giving me this error.
Traceback (most recent call last):
File "C:\Users\SoftLand PC\Desktop\python2.py", line 20, in <module>
fetch_url('https://www.daraz.pk/', debug=True)
File "C:\Users\SoftLand PC\Desktop\python2.py", line 15, in fetch_url
df.write(pp.prettify(soup, 2))
File "C:\Users\SoftLand PC\AppData\Local\Programs\Python\Python37-32\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2713' in position 369: character maps to <undefined>
Posts: 7,320
Threads: 123
Joined: Sep 2016
(Oct-18-2019, 11:44 AM)tahir1990 Wrote: First of all why my code didnt work? It will never work because of JavaScript,just turn of JavaScript in browser and see what you get.
Need to use other tools like Selenium .
Quick demo get link under Flash Sale,need also to make root-url + href to use links as only get href back.
from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.keys import Keys
#--| Setup
options = Options()
#options.add_argument("--headless")
browser = webdriver.Chrome(executable_path=r'chromedriver.exe', options=options)
#--| Parse or automation
browser.get('https://www.daraz.pk/')
browser.implicitly_wait(3)
soup = BeautifulSoup(browser.page_source, 'lxml')
flash_sale = soup.find('div', class_="card-fs-content-body J_FSBody")
for link in flash_sale.find_all('a'):
print(link['href']) Output: //www.daraz.pk/products/magnetic-bluetooth-wireless-stereo-in-ear-sports-handfree-bluetooth-handfree-handsfree-i117746320-s1270732052.html?search=1&mp=1&c=fs
//www.daraz.pk/products/flawless-women-painless-hair-remover-face-facial-hair-remover-i119164615-s1272776495.html?search=1&mp=1&c=fs
//www.daraz.pk/products/brand-new-in-ear-woofer-headphones-super-basser-multicolor-so-i123712768-s1280896199.html?search=1&mp=1&c=fs
//www.daraz.pk/products/windows-10-pro-2019-version-1903-fully-updated-64-bit-dvd-i114684527-s1266482147.html?search=1&mp=1&c=fs
//www.daraz.pk/products/high-quality-tummy-trimmer-black-and-silver-exercise-machine-i118616642-s1271900650.html?search=1&mp=1&c=fs
//www.daraz.pk/products/genuine-leather-card-holder-i121356181-s1276938197.html?search=1&mp=1&c=fs
More about it here Web-scraping part-2.
Posts: 8
Threads: 2
Joined: Oct 2019
(Oct-18-2019, 12:18 PM)snippsat Wrote: (Oct-18-2019, 11:44 AM)tahir1990 Wrote: First of all why my code didnt work? It will never work because of JavaScript,just turn of JavaScript in browser and see what you get.
Need to use other tools like Selenium .
Quick demo get link under Flash Sale,need also to make root-url + href to use links as only get href back.
from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.keys import Keys
#--| Setup
options = Options()
#options.add_argument("--headless")
browser = webdriver.Chrome(executable_path=r'chromedriver.exe', options=options)
#--| Parse or automation
browser.get('https://www.daraz.pk/')
browser.implicitly_wait(3)
soup = BeautifulSoup(browser.page_source, 'lxml')
flash_sale = soup.find('div', class_="card-fs-content-body J_FSBody")
for link in flash_sale.find_all('a'):
print(link['href']) Output: //www.daraz.pk/products/magnetic-bluetooth-wireless-stereo-in-ear-sports-handfree-bluetooth-handfree-handsfree-i117746320-s1270732052.html?search=1&mp=1&c=fs
//www.daraz.pk/products/flawless-women-painless-hair-remover-face-facial-hair-remover-i119164615-s1272776495.html?search=1&mp=1&c=fs
//www.daraz.pk/products/brand-new-in-ear-woofer-headphones-super-basser-multicolor-so-i123712768-s1280896199.html?search=1&mp=1&c=fs
//www.daraz.pk/products/windows-10-pro-2019-version-1903-fully-updated-64-bit-dvd-i114684527-s1266482147.html?search=1&mp=1&c=fs
//www.daraz.pk/products/high-quality-tummy-trimmer-black-and-silver-exercise-machine-i118616642-s1271900650.html?search=1&mp=1&c=fs
//www.daraz.pk/products/genuine-leather-card-holder-i121356181-s1276938197.html?search=1&mp=1&c=fs
More about it here Web-scraping part-2.
Oh got it thanks a lot buddy.
Sometimes html and css gets complex.What i learned some youtube tutorials were simple web page examples.I failed to realize the online shopping website i am trying to learn web scrapping on is not simple web page.I think i should first start with simple looking web pages,right?
Posts: 7,320
Threads: 123
Joined: Sep 2016
Oct-18-2019, 01:21 PM
(This post was last modified: Oct-18-2019, 01:21 PM by snippsat.)
(Oct-18-2019, 12:48 PM)tahir1990 Wrote: I think i should first start with simple looking web pages,right? It's okay to get the basic understating first train with different pages(or just raw html),
as you see i can scrape that page you have with not so much code.
Can look this Web-Scraping part-1.
When i say raw html,so is it just html you make or copy in.
from bs4 import BeautifulSoup
# Simulate a web page
html = '''\
<html>
<head>
<title>My Site</title>
</head>
<body>
<title>First chapter</title>
<p>Page1</p>
<p>Page2</p>
</body>
</html>'''
soup = BeautifulSoup(html, 'html.parser')
print(soup.find('title'))
print(soup.find_all('title')) Output: <title>My Site</title>
[<title>My Site</title>, <title>First chapter</title>]
# Iterative testing
>>> soup.find_all('p')
[<p>Page1</p>, <p>Page2</p>]
>>>
>>> # Using CSS selector
>>> soup.select('body > p')
[<p>Page1</p>, <p>Page2</p>]
>>> soup.select_one('body > p').text
'Page1'
|