Logic behind BeautifulSoup data-parsing

jimsxxl · Apr-11-2021, 01:38 PM

Hello guys,
Im messing around abit with bs4, im trying to parse some data from Youtube as a "learning-project".
What im finding difficult to understand is, when searching for a element to parse (for example video title)...
what should i be looking at? What is the key to get video-title extracted from the HTML code?

How should i think when i inspect an object in my browser?
What piece of code am i interested in ?

Thank you in advance !

***snippsat*** · (This post was last modified: Apr-11-2021, 02:54 PM by snippsat.)

(Apr-11-2021, 01:38 PM)jimsxxl Wrote: Im messing around abit with bs4, im trying to parse some data from Youtube as a "learning-project".
What im finding difficult to understand is, when searching for a element to parse (for example video title)...

Youtube is not an easy site to scrape not something to start with,
need Selenium or underrating how a site like that operate.
there is a API for YouTube that can get info like title.
Look at Web-Scraping part-1, part-2.

Here a post about API usage.

jimsxxl · (This post was last modified: Apr-11-2021, 03:10 PM by jimsxxl.)

Hello snappsat !
I understand what you are saying...
Im a beginner when it comes to Python, with all modules and syntaxes. But im not a beginner when it comes to coding in general.

I scraped Betfair one week ago which i consider a pretty difficult site aswell.
I did that with Selenium.

In this project i wanted to try out BeautifulSoup and requests_html.

I surley could scrape Youtube too if i started to google and view a bunch of examples.
But i want to be able to ”see” it by myself, how i would go about to extract exactly what i want.

I will take a look at your links, thanks alot !

jimsxxl · Apr-11-2021, 05:25 PM

Thanks alot snippsat for the links... it clarified acouple of things for me!
Im using requests_html instead of requests because i noticed that requests got stuck in the "Agree to continue"-page Youtube have.
In my last Betfair project i fixed that with .click, but i wanted to see if it could be done without Selenium and loading a browser into the program.

Here is the code so far:

from bs4 import BeautifulSoup as bs
from requests_html import HTMLSession

tempfile = "/home/xxx/projects/jims-youtube_scraper/tempvideofile.html"
channels = [
    'https://www.youtube.com/channel/UCwTrHPEglCkDz54iSg9ss9Q/videos'       # KanalGratis
    #'https://www.youtube.com/user/svartzonker/videos'                      # Svartzonker
]

title = []
link = []
count = 0

session = HTMLSession()

for c in channels:
    get_response = session.get(c)
    get_response.html.render(sleep=1)
    open(tempfile, "w", encoding='utf8').write(get_response.html.html)
    opentemp = open(tempfile, 'r')
    soup = bs(opentemp, 'html.parser')

    #name = soup.find('yt', class_='style-scope ytd-channel-name')
    #print(name.get('text'))

    for t in soup.find_all('a', class_='yt-simple-endpoint style-scope ytd-grid-video-renderer'):
        title.append(t.get('title'))

    for l in soup.find_all('a', class_='yt-simple-endpoint style-scope ytd-grid-video-renderer'):
        link.append(l.get('href'))

while count != len(title):
    print("Title:", title[count], "URL: www.youtube.com" + link[count])

    count = count + 1

Please, let me know if i could had done it in a better way, or if something looks funny to you.
I would really appreciate some feedback from experianced Python-coders !

***snippsat*** · (This post was last modified: Apr-12-2021, 10:51 AM by snippsat.)

If it work with requests_htm then it's okay.
I have only tested requests_htm(problem not updated regularly Github Repo) briefly,can also use Selenuim and load browser with --headless option.

requests_htm use pyppeteer which is default headless.
Some time is useful the see browser before go headless like see if push button or enter into field,
then Selenium can be better choice.

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

#--| Setup
options = Options()
options.add_argument("--headless")
#options.add_argument("--window-size=1980,1020")
browser = webdriver.Chrome(executable_path=r'C:\cmder\bin\chromedriver.exe', options=options)
#--| Parse or automation
url = "https://www.youtube.com/channel/UCwTrHPEglCkDz54iSg9ss9Q/videos"
browser.get(url)
title = browser.find_elements_by_css_selector('#text-container')[0]
print(title.text)

Output:
kanalgratisdotse

The fasted way is using the YouTube API.

import requests

channel_id = 'UCwTrHPEglCkDz54iSg9ss9Q'
api_key = 'xxxxxxxxxxxxxxxxxxx'

url = f'https://www.googleapis.com/youtube/v3/channels?id={channel_id}&part=snippet&key={api_key}'
response = requests.get(url).json()
print(response['items'][0]['snippet']['title'])

Output:
kanalgratisdotse

jimsxxl · (This post was last modified: Apr-12-2021, 03:07 PM by jimsxxl.)

(Apr-11-2021, 01:38 PM)jimsxxl Wrote: Hello guys,
Im messing around abit with bs4, im trying to parse some data from Youtube as a "learning-project".
What im finding difficult to understand is, when searching for a element to parse (for example video title)...
what should i be looking at? What is the key to get video-title extracted from the HTML code?

How should i think when i inspect an object in my browser?
What piece of code am i interested in ?

Thank you in advance !

(Apr-12-2021, 10:51 AM)snippsat Wrote: If it work with requests_htm then it's okay.
I have only tested requests_htm(problem not updated regularly Github Repo) briefly,can also use Selenuim and load browser with --headless option.

requests_htm use pyppeteer which is default headless.
Some time is useful the see browser before go headless like see if push button or enter into field,
then Selenium can be better choice.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

#--| Setup
options = Options()
options.add_argument("--headless")
#options.add_argument("--window-size=1980,1020")
browser = webdriver.Chrome(executable_path=r'C:\cmder\bin\chromedriver.exe', options=options)
#--| Parse or automation
url = "https://www.youtube.com/channel/UCwTrHPEglCkDz54iSg9ss9Q/videos"
browser.get(url)
title = browser.find_elements_by_css_selector('#text-container')[0]
print(title.text)
Output:
kanalgratisdotse
The fasted way is using the YouTube API.
import requests

channel_id = 'UCwTrHPEglCkDz54iSg9ss9Q'
api_key = 'xxxxxxxxxxxxxxxxxxx'

url = f'https://www.googleapis.com/youtube/v3/channels?id={channel_id}&part=snippet&key={api_key}'
response = requests.get(url).json()
print(response['items'][0]['snippet']['title'])
Output:
kanalgratisdotse

Hi again snippsat!
Yeah, ive tried the —headless option in Selenium.

So basiclly request_html is the same as Selenium with headless-option (as far as getting html code) ?
I thought request_html was ”lighter” than Selenium for some reason, thats why i chose it.

If i would choose to use Selenium this time, would BeautifulSoup be unnessecary then?

I wanted to learn Bs4 in this project, would it be foolish to combine Selenium and BS4 ?

Thanks alot for your replys snippsat !

***snippsat*** · (This post was last modified: Apr-13-2021, 03:33 AM by snippsat.)

(Apr-12-2021, 03:06 PM)jimsxxl Wrote: So basiclly request_html is the same as Selenium with headless-option (as far as getting html code) ?

Resource wise it will be the same as request_html use pyppeteer(headless) chrome/chromium browser automation.

(Apr-12-2021, 03:06 PM)jimsxxl Wrote: If i would choose to use Selenium this time, would BeautifulSoup be unnessecary then?

I wanted to learn Bs4 in this project, would it be foolish to combine Selenium and BS4 ?

It's fine to send browser.page_source to Bs4 and then do parsing with Bs4.
Example:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
import time

#--| Setup
options = Options()
options.add_argument("--headless")
#options.add_argument("--window-size=1980,1020")
browser = webdriver.Chrome(executable_path=r'C:\cmder\bin\chromedriver.exe', options=options)
#--| Parse or automation
url = "https://www.youtube.com/channel/UCwTrHPEglCkDz54iSg9ss9Q/videos"
browser.get(url)
# Send to BS
soup = BeautifulSoup(browser.page_source, 'lxml')
title = soup.select_one('#video-title')
print(title.text)

Output:
WE FISH THE SAME SPOT FOR 12 HOURS - Amazing Results!! | Team Galant

jimsxxl · Apr-13-2021, 09:06 AM

(Apr-11-2021, 01:38 PM)jimsxxl Wrote: Hello guys,
Im messing around abit with bs4, im trying to parse some data from Youtube as a "learning-project".
What im finding difficult to understand is, when searching for a element to parse (for example video title)...
what should i be looking at? What is the key to get video-title extracted from the HTML code?

How should i think when i inspect an object in my browser?
What piece of code am i interested in ?

Thank you in advance !

(Apr-13-2021, 03:33 AM)snippsat Wrote:
(Apr-12-2021, 03:06 PM)jimsxxl Wrote: So basiclly request_html is the same as Selenium with headless-option (as far as getting html code) ?
Resource wise it will be the same as request_html use pyppeteer(headless) chrome/chromium browser automation.

(Apr-12-2021, 03:06 PM)jimsxxl Wrote: If i would choose to use Selenium this time, would BeautifulSoup be unnessecary then?

I wanted to learn Bs4 in this project, would it be foolish to combine Selenium and BS4 ?
It's fine to send browser.page_source to Bs4 and then do parsing with Bs4.
Example:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
import time

#--| Setup
options = Options()
options.add_argument("--headless")
#options.add_argument("--window-size=1980,1020")
browser = webdriver.Chrome(executable_path=r'C:\cmder\bin\chromedriver.exe', options=options)
#--| Parse or automation
url = "https://www.youtube.com/channel/UCwTrHPEglCkDz54iSg9ss9Q/videos"
browser.get(url)
# Send to BS
soup = BeautifulSoup(browser.page_source, 'lxml')
title = soup.select_one('#video-title')
print(title.text)
Output:
WE FISH THE SAME SPOT FOR 12 HOURS - Amazing Results!! | Team Galant

Okey thank you snippsat!
Appreciate all answers alot !
My learning journey continues !

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	BeautifulSoup not parsing other URLs	giddyhead	0	1,800	Feb-23-2022, 05:35 PM Last Post: giddyhead
	BeautifulSoup: 6k records - but stops after parsing 20 lines	apollo	0	2,274	May-10-2021, 05:08 PM Last Post: apollo
	fetching, parsing data from Wikipedia	apollo	2	4,356	May-06-2021, 08:08 PM Last Post: snippsat
	Extract data with Selenium and BeautifulSoup	nestor	3	5,153	Jun-06-2020, 01:34 AM Last Post: Larz60+
	Fetching and Parsing XML Data	FalseFact	3	4,303	Apr-01-2019, 10:21 AM Last Post: Larz60+
	BeautifulSoup Parsing Error	slinkplink	6	13,116	Feb-12-2018, 02:55 PM Last Post: seco
	Beautifulsoup parsing	Larz60+	7	7,470	Apr-05-2017, 03:07 AM Last Post: Larz60+

Logic behind BeautifulSoup data-parsing

User Panel Messages

Announcements