Python Forum
Logic behind BeautifulSoup data-parsing
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Logic behind BeautifulSoup data-parsing
#1
Hello guys,
Im messing around abit with bs4, im trying to parse some data from Youtube as a "learning-project".
What im finding difficult to understand is, when searching for a element to parse (for example video title)...
what should i be looking at? What is the key to get video-title extracted from the HTML code?

How should i think when i inspect an object in my browser?
What piece of code am i interested in ?

Thank you in advance !
Reply
#2
(Apr-11-2021, 01:38 PM)jimsxxl Wrote: Im messing around abit with bs4, im trying to parse some data from Youtube as a "learning-project".
What im finding difficult to understand is, when searching for a element to parse (for example video title)...
Youtube is not an easy site to scrape not something to start with,
need Selenium or underrating how a site like that operate.
there is a API for YouTube that can get info like title.
Look at Web-Scraping part-1, part-2.

Here a post about API usage.
Reply
#3
Hello snappsat !
I understand what you are saying...
Im a beginner when it comes to Python, with all modules and syntaxes. But im not a beginner when it comes to coding in general.

I scraped Betfair one week ago which i consider a pretty difficult site aswell.
I did that with Selenium.

In this project i wanted to try out BeautifulSoup and requests_html.

I surley could scrape Youtube too if i started to google and view a bunch of examples.
But i want to be able to ”see” it by myself, how i would go about to extract exactly what i want.

I will take a look at your links, thanks alot !
Reply
#4
Thanks alot snippsat for the links... it clarified acouple of things for me!
Im using requests_html instead of requests because i noticed that requests got stuck in the "Agree to continue"-page Youtube have.
In my last Betfair project i fixed that with .click, but i wanted to see if it could be done without Selenium and loading a browser into the program.

Here is the code so far:
from bs4 import BeautifulSoup as bs
from requests_html import HTMLSession

tempfile = "/home/xxx/projects/jims-youtube_scraper/tempvideofile.html"
channels = [
    'https://www.youtube.com/channel/UCwTrHPEglCkDz54iSg9ss9Q/videos'       # KanalGratis
    #'https://www.youtube.com/user/svartzonker/videos'                      # Svartzonker
]

title = []
link = []
count = 0

session = HTMLSession()

for c in channels:
    get_response = session.get(c)
    get_response.html.render(sleep=1)
    open(tempfile, "w", encoding='utf8').write(get_response.html.html)
    opentemp = open(tempfile, 'r')
    soup = bs(opentemp, 'html.parser')

    #name = soup.find('yt', class_='style-scope ytd-channel-name')
    #print(name.get('text'))

    for t in soup.find_all('a', class_='yt-simple-endpoint style-scope ytd-grid-video-renderer'):
        title.append(t.get('title'))

    for l in soup.find_all('a', class_='yt-simple-endpoint style-scope ytd-grid-video-renderer'):
        link.append(l.get('href'))

while count != len(title):
    print("Title:", title[count], "URL: www.youtube.com" + link[count])

    count = count + 1
Please, let me know if i could had done it in a better way, or if something looks funny to you.
I would really appreciate some feedback from experianced Python-coders !
Reply
#5
If it work with requests_htm then it's okay.
I have only tested requests_htm(problem not updated regularly Github Repo) briefly,can also use Selenuim and load browser with --headless option.

requests_htm use pyppeteer which is default headless.
Some time is useful the see browser before go headless like see if push button or enter into field,
then Selenium can be better choice.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

#--| Setup
options = Options()
options.add_argument("--headless")
#options.add_argument("--window-size=1980,1020")
browser = webdriver.Chrome(executable_path=r'C:\cmder\bin\chromedriver.exe', options=options)
#--| Parse or automation
url = "https://www.youtube.com/channel/UCwTrHPEglCkDz54iSg9ss9Q/videos"
browser.get(url)
title = browser.find_elements_by_css_selector('#text-container')[0]
print(title.text)
Output:
kanalgratisdotse
The fasted way is using the YouTube API.
import requests

channel_id = 'UCwTrHPEglCkDz54iSg9ss9Q'
api_key = 'xxxxxxxxxxxxxxxxxxx'

url = f'https://www.googleapis.com/youtube/v3/channels?id={channel_id}&part=snippet&key={api_key}'
response = requests.get(url).json()
print(response['items'][0]['snippet']['title'])
Output:
kanalgratisdotse
Reply
#6
(Apr-11-2021, 01:38 PM)jimsxxl Wrote: Hello guys,
Im messing around abit with bs4, im trying to parse some data from Youtube as a "learning-project".
What im finding difficult to understand is, when searching for a element to parse (for example video title)...
what should i be looking at? What is the key to get video-title extracted from the HTML code?

How should i think when i inspect an object in my browser?
What piece of code am i interested in ?

Thank you in advance !

(Apr-12-2021, 10:51 AM)snippsat Wrote: If it work with requests_htm then it's okay.
I have only tested requests_htm(problem not updated regularly Github Repo) briefly,can also use Selenuim and load browser with --headless option.

requests_htm use pyppeteer which is default headless.
Some time is useful the see browser before go headless like see if push button or enter into field,
then Selenium can be better choice.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

#--| Setup
options = Options()
options.add_argument("--headless")
#options.add_argument("--window-size=1980,1020")
browser = webdriver.Chrome(executable_path=r'C:\cmder\bin\chromedriver.exe', options=options)
#--| Parse or automation
url = "https://www.youtube.com/channel/UCwTrHPEglCkDz54iSg9ss9Q/videos"
browser.get(url)
title = browser.find_elements_by_css_selector('#text-container')[0]
print(title.text)
Output:
kanalgratisdotse
The fasted way is using the YouTube API.
import requests

channel_id = 'UCwTrHPEglCkDz54iSg9ss9Q'
api_key = 'xxxxxxxxxxxxxxxxxxx'

url = f'https://www.googleapis.com/youtube/v3/channels?id={channel_id}&part=snippet&key={api_key}'
response = requests.get(url).json()
print(response['items'][0]['snippet']['title'])
Output:
kanalgratisdotse

Hi again snippsat!
Yeah, ive tried the —headless option in Selenium.

So basiclly request_html is the same as Selenium with headless-option (as far as getting html code) ?
I thought request_html was ”lighter” than Selenium for some reason, thats why i chose it.

If i would choose to use Selenium this time, would BeautifulSoup be unnessecary then?

I wanted to learn Bs4 in this project, would it be foolish to combine Selenium and BS4 ?

Thanks alot for your replys snippsat !
Reply
#7
(Apr-12-2021, 03:06 PM)jimsxxl Wrote: So basiclly request_html is the same as Selenium with headless-option (as far as getting html code) ?
Resource wise it will be the same as request_html use pyppeteer(headless) chrome/chromium browser automation.

(Apr-12-2021, 03:06 PM)jimsxxl Wrote: If i would choose to use Selenium this time, would BeautifulSoup be unnessecary then?

I wanted to learn Bs4 in this project, would it be foolish to combine Selenium and BS4 ?
It's fine to send browser.page_source to Bs4 and then do parsing with Bs4.
Example:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
import time

#--| Setup
options = Options()
options.add_argument("--headless")
#options.add_argument("--window-size=1980,1020")
browser = webdriver.Chrome(executable_path=r'C:\cmder\bin\chromedriver.exe', options=options)
#--| Parse or automation
url = "https://www.youtube.com/channel/UCwTrHPEglCkDz54iSg9ss9Q/videos"
browser.get(url)
# Send to BS
soup = BeautifulSoup(browser.page_source, 'lxml')
title = soup.select_one('#video-title')
print(title.text)
Output:
WE FISH THE SAME SPOT FOR 12 HOURS - Amazing Results!! | Team Galant
Reply
#8
(Apr-11-2021, 01:38 PM)jimsxxl Wrote: Hello guys,
Im messing around abit with bs4, im trying to parse some data from Youtube as a "learning-project".
What im finding difficult to understand is, when searching for a element to parse (for example video title)...
what should i be looking at? What is the key to get video-title extracted from the HTML code?

How should i think when i inspect an object in my browser?
What piece of code am i interested in ?

Thank you in advance !

(Apr-13-2021, 03:33 AM)snippsat Wrote:
(Apr-12-2021, 03:06 PM)jimsxxl Wrote: So basiclly request_html is the same as Selenium with headless-option (as far as getting html code) ?
Resource wise it will be the same as request_html use pyppeteer(headless) chrome/chromium browser automation.

(Apr-12-2021, 03:06 PM)jimsxxl Wrote: If i would choose to use Selenium this time, would BeautifulSoup be unnessecary then?

I wanted to learn Bs4 in this project, would it be foolish to combine Selenium and BS4 ?
It's fine to send browser.page_source to Bs4 and then do parsing with Bs4.
Example:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
import time

#--| Setup
options = Options()
options.add_argument("--headless")
#options.add_argument("--window-size=1980,1020")
browser = webdriver.Chrome(executable_path=r'C:\cmder\bin\chromedriver.exe', options=options)
#--| Parse or automation
url = "https://www.youtube.com/channel/UCwTrHPEglCkDz54iSg9ss9Q/videos"
browser.get(url)
# Send to BS
soup = BeautifulSoup(browser.page_source, 'lxml')
title = soup.select_one('#video-title')
print(title.text)
Output:
WE FISH THE SAME SPOT FOR 12 HOURS - Amazing Results!! | Team Galant

Okey thank you snippsat!
Appreciate all answers alot !
My learning journey continues !
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  BeautifulSoup not parsing other URLs giddyhead 0 1,168 Feb-23-2022, 05:35 PM
Last Post: giddyhead
  BeautifulSoup: 6k records - but stops after parsing 20 lines apollo 0 1,787 May-10-2021, 05:08 PM
Last Post: apollo
  fetching, parsing data from Wikipedia apollo 2 3,502 May-06-2021, 08:08 PM
Last Post: snippsat
  Extract data with Selenium and BeautifulSoup nestor 3 3,816 Jun-06-2020, 01:34 AM
Last Post: Larz60+
  Fetching and Parsing XML Data FalseFact 3 3,200 Apr-01-2019, 10:21 AM
Last Post: Larz60+
  BeautifulSoup Parsing Error slinkplink 6 9,455 Feb-12-2018, 02:55 PM
Last Post: seco
  Beautifulsoup parsing Larz60+ 7 6,000 Apr-05-2017, 03:07 AM
Last Post: Larz60+

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020