Python Forum

Full Version: Beautiful Soup (suddenly) doesn't get full webpage html
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hello all,
few months ago I dabbled in Beautiful Soup for first time, so I still lack much understanding of the module and entire subject.
However, the issue I'm having is that B.S. parsed the whole page HTML just fine. And this time, when re-running the same code, I only get partial HTML response, most of the response being lines of JavaScript.

The code is:
import requests
from bs4 import BeautifulSoup

url = "https://www.youtube.com/results?search_query=python"
response = requests.get(url)

soup = BeautifulSoup(response.content, "html.parser")
# print(soup)
Any tip will be much appreciated.
Thanks and best regards,
JC
Hey bro,

So i would recommend using the code that i used below:-
It returns full request. Its hard for me to know what it is missing compared to an original you ran months ago tho...

The below query works nicely as far as i am aware.

import requests
from bs4 import BeautifulSoup

url = requests.get("https://www.youtube.com/results?search_query=python").content
soup = BeautifulSoup(url, 'lxml') # You can use html.parser here alternatively - Depends on what you are wanting to achieve
print(soup)
j.crater Wrote:most of the response being lines of JavaScript.
Look at Web-scraping part-2 under:
snippsat Wrote:JavaScript,why do i not get all content Wall

So to give a demo of using both BS and Selenium to parse.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
import time

#--| Setup
options = Options()
#options.add_argument("--headless")
#options.add_argument("--window-size=1980,1020")
browser = webdriver.Chrome(executable_path=r'C:\cmder\bin\chromedriver.exe', options=options)
#--| Parse or automation
url = "https://www.youtube.com/results?search_query=python"
browser.get(url)
time.sleep(2)

# Use Bs to Parse
soup = BeautifulSoup(browser.page_source, 'lxml')
first_title = soup.find('a', id="video-title")
print(first_title.text.strip())

print('-' * 50)
# Use Selenium to parse
second_title_sel = browser.find_elements_by_xpath('//*[@id="video-title"]')
print(second_title_sel[1].text)
Output:
Learn Python - Full Course for Beginners [Tutorial] -------------------------------------------------- Python Tutorial - Python for Beginners [Full Course]
YouTube has also a API YouTube Data API that can be used from Python.
Example this post.
Thank you both for answers.

@HarleyQuin
The code I ran months ago was same as I posted here, but result was not same. As stated, on my first attempt I got all the HTML contents, while this time I didn't. Also, replacing the parser for lxml parser didn't make a difference. Do you have any idea, from experience, why such difference?

@snippsat
Your code returns all the HTML contents of the page, if I print the soup. Is the main factor here 2 seconds sleep, which allows the Javascript to execute completely before parsing the HTML? However, to reiterate, my original run of the code returned complete HTML contents. Could it be that website render just got slower for some reason, since my last attempt at parsing (few months ago)?
(Jul-11-2020, 11:28 AM)j.crater Wrote: [ -> ]Your code returns all the HTML contents of the page, if I print the soup. Is the main factor here 2 seconds sleep, which allows the Javascript to execute completely before parsing the HTML? However,
The 2-seconds sleep has nothing to about this just there for safety(to make sure all page has loaded) can comment it out and it still work.
It's Selenium that's that's important part here.
In link Web-scraping part-2.
snippsat Wrote:JavaScript is used all over the web because it's unique position to run in Browser(client side).
This can make it more difficult to do parsing,
because Requests/bs4/lxml can not get all that's is executed/rendered bye JavaScript.

There are way to overcome this,gone use Selenium

When you just parse with Requests and BS,you will not get the executed JavaScript but only the raw content.
Then you will not at all find as example this tag soup.find('a', id="video-title")
Because getting raw JavaScript back.
It will be in a script tag,here a clean up version bye deleting a lot get where title is.
<script>
    window["ytInitialData"] .... = "title":{"runs":[{"text":"Learn Python - Full Course for Beginners [Tutorial]"}],"accessibility":{"accessibilityData":{"label":"Learn Python "viewCountText":{"simpleText":"Sett 16 184 859 ganger"},.....
    window["ytInitialPlayerResponse"] = null;
    if (window.ytcsi) {window.ytcsi.tick("pdr", null, '');}
</script>
To parse this raw JavaScript is almost impossible that's why use Selenium to get the executed JavaScript back.
(Jul-11-2020, 11:52 AM)j.crater Wrote: [ -> ]Thank you both for answers.

@HarleyQuin
The code I ran months ago was same as I posted here, but result was not same. As stated, on my first attempt I got all the HTML contents, while this time I didn't. Also, replacing the parser for lxml parser didn't make a difference. Do you have any idea, from experience, why such difference?

Hey again,

From experience i have noticed that not using a user-agent/header makes it very easy for YouTube to immediately identify you as a web scraper and deal with your request connection differently to how a conventional user may be welcomed by the site. That is something that made a difference when i first started scraping.

e.g. i use this in my code:


import requests

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36',
    "Content-Type": "application/x-www-form-urlencoded"}

url = "https://whatsmyua.info/"

webpage = requests.get(url, headers=headers).text
print(webpage)

Sorry if i have been of no use!

I hope you solve your issue buddy,

Regards,

Harley
@HarleyQuin
This is a very clever approach, I will probably be using preset headers from now. I would probably never even consider effects websites might have on "robot" users.

@snippsat
Your code works well and I can definitely continue from here. Frankly, I have no idea what was different on my attempt this time, since using requests.get() and B.S. gave good results on my first try.
Given your examples with B.S. and Selenium, can Selenium replace B.S. entirely for use with scraping/navigating websites? In that case, I will stick to Selenium down the road, to avoid overhead and invest in learning one tool well instead.
(Jul-11-2020, 02:43 PM)j.crater Wrote: [ -> ]Your code works well and I can definitely continue from here. Frankly, I have no idea what was different on my attempt this time, since using requests.get() and B.S. gave good results on my first try.
They may have changes source,so now is almost all code generated bye JavaScript.
(Jul-11-2020, 02:43 PM)j.crater Wrote: [ -> ]Given your examples with B.S. and Selenium, can Selenium replace B.S. entirely for use with scraping/navigating websites?
You use Selenium only when it's necessary and can not get content only using Requests/BS.
This is usually the case with heavy sites eg to pick a example stock exchanges sites that we have many Thread about.

To better understand what JavaScript DOM(Document Object Model) dos in browser.
Use this address as before:
https://www.youtube.com/results?search_query=python
Now turn off JavaScript in Browser,the reload page what do you see now?
Quote:They may have changes source,so now is almost all code generated bye JavaScript.
This is most likely the case indeed.

Quote:Now turn off JavaScript in Browser,the reload page what do you see now?
And this seems to prove it. By disabling JavaScript and then checking page source, I see the results I got from B.S.

Thanks a lot for help and tips.