Python Forum
Beautifulsoup don't get me the page
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Beautifulsoup don't get me the page
#1
Hi. I use this code
import requests
from bs4 import BeautifulSoup
pagina1="https://www.fragrantica.com/perfume/Chanel/Coco-Eau-de-Parfum-609.html"
pagina1=requests.get(pagina1, headers = {'User-agent': 'your bot 0.1'})
soup=BeautifulSoup(pagina1.content,"html.parser")
print(soup)
the result is not the source code of the page but someting like
<!DOCTYPE html>

<!--[if lt IE 7]> <html class="no-js ie6 oldie" lang="en-US"> <![endif]-->
<!--[if IE 7]> <html class="no-js ie7 oldie" lang="en-US"> <![endif]-->
<!--[if IE 8]> <html class="no-js ie8 oldie" lang="en-US"> <![endif]-->
<!--[if gt IE 8]><!--> <html class="no-js" lang="en-US"> <!--<![endif]-->
<head>
<title>Attention Required! | Cloudflare</title>
<meta id="captcha-bypass" name="captcha-bypass"/>
and a few more lines.
What happened? Why I can't get the source code with beautifulsoup?
Thank you
Reply
#2
check status to make sure page has been downloaded.
response = requests.get(pagina1)
if response.status_code == 200:
    soup=BeautifulSoup(response.content,"lxml")
else:
    print(f"unable to fetch page: {pagina1}")
Reply
#3
based on the html i got with your code, it looks like you are getting a captcha
<h2 data-translate="why_captcha_headline">Why do I have to complete a CAPTCHA?</h2>
<p data-translate="why_captcha_detail">Completing the CAPTCHA proves you are a human and gives you temporary access to the web property.</p>
Recommended Tutorials:
Reply
#4
follow the link there is no CAPTCHA on this website.
I think the problem is with User-Agent. For some reason the website detects that the request is not from a browser. I solved the issue if I can read with python the source code of the page. But I don't know how to read, and get values, from the source code in Python.
Reply
#5
First step is to try the user-agent that this site use.
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36
That dos not work as i tested it.
Next step is to use Selenium.
from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.keys import Keys
import time

#--| Setup
options = Options()
#options.add_argument("--headless")
browser = webdriver.Chrome(executable_path=r'chromedriver.exe', options=options)
#--| Parse or automation
browser.get('https://www.fragrantica.com/perfume/Chanel/Coco-Eau-de-Parfum-609.html')
soup = BeautifulSoup(browser.page_source, 'lxml')
browser.implicitly_wait(5)
parfum = soup.select('#col1 > div > div > h1 > span')
Now it work,eg here i use CSS seletor to get parfum title name.
Text would be:
>>> parfum
[<span itemprop="name">Coco Eau de Parfum Chanel for women</span>]
>>> parfum[0].text
'Coco Eau de Parfum Chanel for women'
Reply
#6
Thank you
Selenium seems to be the only option for this website.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Beautifulsoup doesn't scrape page (python 2.7) Hikki 0 1,951 Aug-01-2020, 05:54 PM
Last Post: Hikki
  use Xpath in Python :: libxml2 for a page-to-page skip-setting apollo 2 3,580 Mar-19-2020, 06:13 PM
Last Post: apollo

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020