(Jul-11-2020, 11:28 AM)j.crater Wrote: Your code returns all the HTML contents of the page, if I print the soup. Is the main factor here 2 seconds sleep, which allows the Javascript to execute completely before parsing the HTML? However,The 2-seconds sleep has nothing to about this just there for safety(to make sure all page has loaded) can comment it out and it still work.
It's Selenium that's that's important part here.
In link Web-scraping part-2.
snippsat Wrote:JavaScript is used all over the web because it's unique position to run in Browser(client side).
This can make it more difficult to do parsing,
because Requests/bs4/lxml can not get all that's is executed/rendered bye JavaScript.
There are way to overcome this,gone use Selenium
When you just parse with Requests and BS,you will not get the executed JavaScript but only the raw content.
Then you will not at all find as example this tag
soup.find('a', id="video-title")
Because getting raw JavaScript back.
It will be in a
script
tag,here a clean up version bye deleting a lot get where title
is.<script> window["ytInitialData"] .... = "title":{"runs":[{"text":"Learn Python - Full Course for Beginners [Tutorial]"}],"accessibility":{"accessibilityData":{"label":"Learn Python "viewCountText":{"simpleText":"Sett 16 184 859 ganger"},..... window["ytInitialPlayerResponse"] = null; if (window.ytcsi) {window.ytcsi.tick("pdr", null, '');} </script>To parse this raw JavaScript is almost impossible that's why use Selenium to get the executed JavaScript back.