(Dec-04-2022, 12:06 AM)Extra Wrote: How do I fix this?You most look at data of you get back
print(soup)
,before you try to parse.So it's blocked and no parsing at all will work.
Output:<p class="a-last">Sorry, we just need to make sure you're not a robot. For best results, please make sure your browser is accepting cookies.</p>
Can use Selenium to bypass this,but also writing a better headers will fix it.import requests from bs4 import BeautifulSoup from time import sleep url = 'https://www.amazon.ca/MSI-Geforce-192-bit-Support-Graphics/dp/B07ZHDZ1K6/ref=sr_1_16?crid=1M9LHOYX99CQW&keywords=Nvidia%2BGTX%2B1060&qid=1670109381&sprefix=nvidia%2Bgtx%2B1060%2Caps%2C79&sr=8-16&th=1' headers = { 'authority': 'www.amazon.com', 'pragma': 'no-cache', 'cache-control': 'no-cache', 'dnt': '1', 'upgrade-insecure-requests': '1', 'user-agent': 'Mozilla/5.0 (X11; CrOS x86_64 8172.45.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.64 Safari/537.36', 'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9', 'sec-fetch-site': 'none', 'sec-fetch-mode': 'navigate', 'sec-fetch-dest': 'document', 'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8', } response = requests.get(url, headers=headers) soup = BeautifulSoup(response.content, 'lxml') sleep(2) #product_title = soup.find('span', id='productTitle') product_title = soup.select_one('#productTitle')
>>> product_title <span class="a-size-large product-title-word-break" id="productTitle"> MSI Gaming Geforce GTX 1660 Super 192-bit HDMI/DP 6GB GDRR6 HDCP Support DirectX 12 Dual Fan VR Ready OC Graphics Card </span> >>> >>> print(product_title.text.strip()) MSI Gaming Geforce GTX 1660 Super 192-bit HDMI/DP 6GB GDRR6 HDCP Support DirectX 12 Dual Fan VR Ready OC Graphics CardAlso see that i use
response.content
this means that bytes are taking into BS so it can deal with Unicode.Using
response.text
it will try to convert before taking into BS.