Python Forum
Beautiful Soup (suddenly) doesn't get full webpage html
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Beautiful Soup (suddenly) doesn't get full webpage html
#1
Hello all,
few months ago I dabbled in Beautiful Soup for first time, so I still lack much understanding of the module and entire subject.
However, the issue I'm having is that B.S. parsed the whole page HTML just fine. And this time, when re-running the same code, I only get partial HTML response, most of the response being lines of JavaScript.

The code is:
import requests
from bs4 import BeautifulSoup

url = "https://www.youtube.com/results?search_query=python"
response = requests.get(url)

soup = BeautifulSoup(response.content, "html.parser")
# print(soup)
Any tip will be much appreciated.
Thanks and best regards,
JC
Reply
#2
Hey bro,

So i would recommend using the code that i used below:-
It returns full request. Its hard for me to know what it is missing compared to an original you ran months ago tho...

The below query works nicely as far as i am aware.

import requests
from bs4 import BeautifulSoup

url = requests.get("https://www.youtube.com/results?search_query=python").content
soup = BeautifulSoup(url, 'lxml') # You can use html.parser here alternatively - Depends on what you are wanting to achieve
print(soup)
Reply
#3
j.crater Wrote:most of the response being lines of JavaScript.
Look at Web-scraping part-2 under:
snippsat Wrote:JavaScript,why do i not get all content Wall

So to give a demo of using both BS and Selenium to parse.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
import time

#--| Setup
options = Options()
#options.add_argument("--headless")
#options.add_argument("--window-size=1980,1020")
browser = webdriver.Chrome(executable_path=r'C:\cmder\bin\chromedriver.exe', options=options)
#--| Parse or automation
url = "https://www.youtube.com/results?search_query=python"
browser.get(url)
time.sleep(2)

# Use Bs to Parse
soup = BeautifulSoup(browser.page_source, 'lxml')
first_title = soup.find('a', id="video-title")
print(first_title.text.strip())

print('-' * 50)
# Use Selenium to parse
second_title_sel = browser.find_elements_by_xpath('//*[@id="video-title"]')
print(second_title_sel[1].text)
Output:
Learn Python - Full Course for Beginners [Tutorial] -------------------------------------------------- Python Tutorial - Python for Beginners [Full Course]
YouTube has also a API YouTube Data API that can be used from Python.
Example this post.
Reply
#4
Thank you both for answers.

@HarleyQuin
The code I ran months ago was same as I posted here, but result was not same. As stated, on my first attempt I got all the HTML contents, while this time I didn't. Also, replacing the parser for lxml parser didn't make a difference. Do you have any idea, from experience, why such difference?

@snippsat
Your code returns all the HTML contents of the page, if I print the soup. Is the main factor here 2 seconds sleep, which allows the Javascript to execute completely before parsing the HTML? However, to reiterate, my original run of the code returned complete HTML contents. Could it be that website render just got slower for some reason, since my last attempt at parsing (few months ago)?
Reply
#5
(Jul-11-2020, 11:28 AM)j.crater Wrote: Your code returns all the HTML contents of the page, if I print the soup. Is the main factor here 2 seconds sleep, which allows the Javascript to execute completely before parsing the HTML? However,
The 2-seconds sleep has nothing to about this just there for safety(to make sure all page has loaded) can comment it out and it still work.
It's Selenium that's that's important part here.
In link Web-scraping part-2.
snippsat Wrote:JavaScript is used all over the web because it's unique position to run in Browser(client side).
This can make it more difficult to do parsing,
because Requests/bs4/lxml can not get all that's is executed/rendered bye JavaScript.

There are way to overcome this,gone use Selenium

When you just parse with Requests and BS,you will not get the executed JavaScript but only the raw content.
Then you will not at all find as example this tag soup.find('a', id="video-title")
Because getting raw JavaScript back.
It will be in a script tag,here a clean up version bye deleting a lot get where title is.
<script>
    window["ytInitialData"] .... = "title":{"runs":[{"text":"Learn Python - Full Course for Beginners [Tutorial]"}],"accessibility":{"accessibilityData":{"label":"Learn Python "viewCountText":{"simpleText":"Sett 16 184 859 ganger"},.....
    window["ytInitialPlayerResponse"] = null;
    if (window.ytcsi) {window.ytcsi.tick("pdr", null, '');}
</script>
To parse this raw JavaScript is almost impossible that's why use Selenium to get the executed JavaScript back.
Reply
#6
(Jul-11-2020, 11:52 AM)j.crater Wrote: Thank you both for answers.

@HarleyQuin
The code I ran months ago was same as I posted here, but result was not same. As stated, on my first attempt I got all the HTML contents, while this time I didn't. Also, replacing the parser for lxml parser didn't make a difference. Do you have any idea, from experience, why such difference?

Hey again,

From experience i have noticed that not using a user-agent/header makes it very easy for YouTube to immediately identify you as a web scraper and deal with your request connection differently to how a conventional user may be welcomed by the site. That is something that made a difference when i first started scraping.

e.g. i use this in my code:

import requests

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36',
    "Content-Type": "application/x-www-form-urlencoded"}

url = "https://whatsmyua.info/"

webpage = requests.get(url, headers=headers).text
print(webpage)
Sorry if i have been of no use!

I hope you solve your issue buddy,

Regards,

Harley
Reply
#7
@HarleyQuin
This is a very clever approach, I will probably be using preset headers from now. I would probably never even consider effects websites might have on "robot" users.

@snippsat
Your code works well and I can definitely continue from here. Frankly, I have no idea what was different on my attempt this time, since using requests.get() and B.S. gave good results on my first try.
Given your examples with B.S. and Selenium, can Selenium replace B.S. entirely for use with scraping/navigating websites? In that case, I will stick to Selenium down the road, to avoid overhead and invest in learning one tool well instead.
Reply
#8
(Jul-11-2020, 02:43 PM)j.crater Wrote: Your code works well and I can definitely continue from here. Frankly, I have no idea what was different on my attempt this time, since using requests.get() and B.S. gave good results on my first try.
They may have changes source,so now is almost all code generated bye JavaScript.
(Jul-11-2020, 02:43 PM)j.crater Wrote: Given your examples with B.S. and Selenium, can Selenium replace B.S. entirely for use with scraping/navigating websites?
You use Selenium only when it's necessary and can not get content only using Requests/BS.
This is usually the case with heavy sites eg to pick a example stock exchanges sites that we have many Thread about.

To better understand what JavaScript DOM(Document Object Model) dos in browser.
Use this address as before:
https://www.youtube.com/results?search_query=python
Now turn off JavaScript in Browser,the reload page what do you see now?
Reply
#9
Quote:They may have changes source,so now is almost all code generated bye JavaScript.
This is most likely the case indeed.

Quote:Now turn off JavaScript in Browser,the reload page what do you see now?
And this seems to prove it. By disabling JavaScript and then checking page source, I see the results I got from B.S.

Thanks a lot for help and tips.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Selenium suddenly fails to find element Pavel_47 3 6,201 Sep-04-2022, 11:06 AM
Last Post: Pavel_47
Lightbulb Python Obstacles | Kung-Fu | Full File HTML Document Scrape and Store it in MariaDB BrandonKastning 5 2,819 Dec-29-2021, 02:26 AM
Last Post: BrandonKastning
  Beautiful Soup - access a rating value in a class KatMac 1 3,420 Apr-16-2021, 01:27 PM
Last Post: snippsat
  HTML multi select HTML listbox with Flask/Python rfeyer 0 4,533 Mar-14-2021, 12:23 PM
Last Post: rfeyer
  *Beginner* web scraping/Beautiful Soup help 7ken8 2 2,561 Jan-28-2021, 04:26 PM
Last Post: 7ken8
  Help: Beautiful Soup - Parsing HTML table ironfelix717 2 2,623 Oct-01-2020, 02:19 PM
Last Post: snippsat
  Requests-HTML vs Beautiful Soup - How to Choose? robin73 0 3,780 Jun-23-2020, 02:53 PM
Last Post: robin73
  Python3 + BeautifulSoup4 + lxml (HTML -> CSV) - How to loop to next HTML/new CSV Row BrandonKastning 0 2,329 Mar-22-2020, 06:10 AM
Last Post: BrandonKastning
  looking for direction - scrappy, crawler, beautiful soup Sly_Corn 2 2,403 Mar-17-2020, 03:17 PM
Last Post: Sly_Corn
  Beautiful soup truncates results jonesjoz 4 3,800 Mar-09-2020, 06:04 PM
Last Post: jonesjoz

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020