Beautifulsoup don't get me the page - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html) +--- Thread: Beautifulsoup don't get me the page (/thread-21917.html) |
Beautifulsoup don't get me the page - mariolopes - Oct-20-2019 Hi. I use this code import requests from bs4 import BeautifulSoup pagina1="https://www.fragrantica.com/perfume/Chanel/Coco-Eau-de-Parfum-609.html" pagina1=requests.get(pagina1, headers = {'User-agent': 'your bot 0.1'}) soup=BeautifulSoup(pagina1.content,"html.parser") print(soup)the result is not the source code of the page but someting like <!DOCTYPE html> <!--[if lt IE 7]> <html class="no-js ie6 oldie" lang="en-US"> <![endif]--> <!--[if IE 7]> <html class="no-js ie7 oldie" lang="en-US"> <![endif]--> <!--[if IE 8]> <html class="no-js ie8 oldie" lang="en-US"> <![endif]--> <!--[if gt IE 8]><!--> <html class="no-js" lang="en-US"> <!--<![endif]--> <head> <title>Attention Required! | Cloudflare</title> <meta id="captcha-bypass" name="captcha-bypass"/> and a few more lines. What happened? Why I can't get the source code with beautifulsoup? Thank you RE: Beautifulsoup don't get me the page - Larz60+ - Oct-20-2019 check status to make sure page has been downloaded. response = requests.get(pagina1) if response.status_code == 200: soup=BeautifulSoup(response.content,"lxml") else: print(f"unable to fetch page: {pagina1}") RE: Beautifulsoup don't get me the page - metulburr - Oct-20-2019 based on the html i got with your code, it looks like you are getting a captcha <h2 data-translate="why_captcha_headline">Why do I have to complete a CAPTCHA?</h2> <p data-translate="why_captcha_detail">Completing the CAPTCHA proves you are a human and gives you temporary access to the web property.</p> RE: Beautifulsoup don't get me the page - mariolopes - Oct-22-2019 follow the link there is no CAPTCHA on this website. I think the problem is with User-Agent. For some reason the website detects that the request is not from a browser. I solved the issue if I can read with python the source code of the page. But I don't know how to read, and get values, from the source code in Python. RE: Beautifulsoup don't get me the page - snippsat - Oct-22-2019 First step is to try the user-agent that this site use. Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36That dos not work as i tested it. Next step is to use Selenium. from selenium import webdriver from bs4 import BeautifulSoup from selenium.webdriver.chrome.options import Options from selenium.webdriver.common.keys import Keys import time #--| Setup options = Options() #options.add_argument("--headless") browser = webdriver.Chrome(executable_path=r'chromedriver.exe', options=options) #--| Parse or automation browser.get('https://www.fragrantica.com/perfume/Chanel/Coco-Eau-de-Parfum-609.html') soup = BeautifulSoup(browser.page_source, 'lxml') browser.implicitly_wait(5) parfum = soup.select('#col1 > div > div > h1 > span')Now it work,eg here i use CSS seletor to get parfum title name. Text would be: >>> parfum [<span itemprop="name">Coco Eau de Parfum Chanel for women</span>] >>> parfum[0].text 'Coco Eau de Parfum Chanel for women' RE: Beautifulsoup don't get me the page - mariolopes - Oct-23-2019 Thank you Selenium seems to be the only option for this website. |