Posts: 9
Threads: 3
Joined: Apr 2021
Hello guys,
Im messing around abit with bs4, im trying to parse some data from Youtube as a "learning-project".
What im finding difficult to understand is, when searching for a element to parse (for example video title)...
what should i be looking at? What is the key to get video-title extracted from the HTML code?
How should i think when i inspect an object in my browser?
What piece of code am i interested in ?
Thank you in advance !
Posts: 7,313
Threads: 123
Joined: Sep 2016
Apr-11-2021, 02:54 PM
(This post was last modified: Apr-11-2021, 02:54 PM by snippsat.)
(Apr-11-2021, 01:38 PM)jimsxxl Wrote: Im messing around abit with bs4, im trying to parse some data from Youtube as a "learning-project".
What im finding difficult to understand is, when searching for a element to parse (for example video title)... Youtube is not an easy site to scrape not something to start with,
need Selenium or underrating how a site like that operate.
there is a API for YouTube that can get info like title.
Look at Web-Scraping part-1, part-2.
Here a post about API usage.
Posts: 9
Threads: 3
Joined: Apr 2021
Apr-11-2021, 03:10 PM
(This post was last modified: Apr-11-2021, 03:10 PM by jimsxxl.)
Hello snappsat !
I understand what you are saying...
Im a beginner when it comes to Python, with all modules and syntaxes. But im not a beginner when it comes to coding in general.
I scraped Betfair one week ago which i consider a pretty difficult site aswell.
I did that with Selenium.
In this project i wanted to try out BeautifulSoup and requests_html.
I surley could scrape Youtube too if i started to google and view a bunch of examples.
But i want to be able to ”see” it by myself, how i would go about to extract exactly what i want.
I will take a look at your links, thanks alot !
Posts: 9
Threads: 3
Joined: Apr 2021
Thanks alot snippsat for the links... it clarified acouple of things for me!
Im using requests_html instead of requests because i noticed that requests got stuck in the "Agree to continue"-page Youtube have.
In my last Betfair project i fixed that with .click, but i wanted to see if it could be done without Selenium and loading a browser into the program.
Here is the code so far:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 |
from bs4 import BeautifulSoup as bs
from requests_html import HTMLSession
tempfile = "/home/xxx/projects/jims-youtube_scraper/tempvideofile.html"
channels = [
]
title = []
link = []
count = 0
session = HTMLSession()
for c in channels:
get_response = session.get(c)
get_response.html.render(sleep = 1 )
open (tempfile, "w" , encoding = 'utf8' ).write(get_response.html.html)
opentemp = open (tempfile, 'r' )
soup = bs(opentemp, 'html.parser' )
for t in soup.find_all( 'a' , class_ = 'yt-simple-endpoint style-scope ytd-grid-video-renderer' ):
title.append(t.get( 'title' ))
for l in soup.find_all( 'a' , class_ = 'yt-simple-endpoint style-scope ytd-grid-video-renderer' ):
link.append(l.get( 'href' ))
while count ! = len (title):
print ( "Title:" , title[count], "URL: www.youtube.com" + link[count])
count = count + 1
|
Please, let me know if i could had done it in a better way, or if something looks funny to you.
I would really appreciate some feedback from experianced Python-coders !
Posts: 7,313
Threads: 123
Joined: Sep 2016
Apr-12-2021, 10:51 AM
(This post was last modified: Apr-12-2021, 10:51 AM by snippsat.)
If it work with requests_htm then it's okay.
I have only tested requests_htm(problem not updated regularly Github Repo) briefly,can also use Selenuim and load browser with --headless option.
requests_htm use pyppeteer which is default headless .
Some time is useful the see browser before go headless like see if push button or enter into field,
then Selenium can be better choice.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
options = Options()
options.add_argument( "--headless" )
browser = webdriver.Chrome(executable_path = r 'C:\cmder\bin\chromedriver.exe' , options = options)
browser.get(url)
title = browser.find_elements_by_css_selector( '#text-container' )[ 0 ]
print (title.text)
|
Output: kanalgratisdotse
The fasted way is using the YouTube API.
1 2 3 4 5 6 7 8 |
import requests
channel_id = 'UCwTrHPEglCkDz54iSg9ss9Q'
api_key = 'xxxxxxxxxxxxxxxxxxx'
response = requests.get(url).json()
print (response[ 'items' ][ 0 ][ 'snippet' ][ 'title' ])
|
Output: kanalgratisdotse
Posts: 9
Threads: 3
Joined: Apr 2021
Apr-12-2021, 03:06 PM
(This post was last modified: Apr-12-2021, 03:07 PM by jimsxxl.)
(Apr-11-2021, 01:38 PM)jimsxxl Wrote: Hello guys,
Im messing around abit with bs4, im trying to parse some data from Youtube as a "learning-project".
What im finding difficult to understand is, when searching for a element to parse (for example video title)...
what should i be looking at? What is the key to get video-title extracted from the HTML code?
How should i think when i inspect an object in my browser?
What piece of code am i interested in ?
Thank you in advance !
(Apr-12-2021, 10:51 AM)snippsat Wrote: If it work with requests_htm then it's okay.
I have only tested requests_htm(problem not updated regularly Github Repo) briefly,can also use Selenuim and load browser with --headless option.
requests_htm use pyppeteer which is default headless .
Some time is useful the see browser before go headless like see if push button or enter into field,
then Selenium can be better choice.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
options = Options()
options.add_argument( "--headless" )
browser = webdriver.Chrome(executable_path = r 'C:\cmder\bin\chromedriver.exe' , options = options)
browser.get(url)
title = browser.find_elements_by_css_selector( '#text-container' )[ 0 ]
print (title.text)
|
Output: kanalgratisdotse
The fasted way is using the YouTube API.
1 2 3 4 5 6 7 8 |
import requests
channel_id = 'UCwTrHPEglCkDz54iSg9ss9Q'
api_key = 'xxxxxxxxxxxxxxxxxxx'
response = requests.get(url).json()
print (response[ 'items' ][ 0 ][ 'snippet' ][ 'title' ])
|
Output: kanalgratisdotse
Hi again snippsat!
Yeah, ive tried the —headless option in Selenium.
So basiclly request_html is the same as Selenium with headless-option (as far as getting html code) ?
I thought request_html was ”lighter” than Selenium for some reason, thats why i chose it.
If i would choose to use Selenium this time, would BeautifulSoup be unnessecary then?
I wanted to learn Bs4 in this project, would it be foolish to combine Selenium and BS4 ?
Thanks alot for your replys snippsat !
Posts: 7,313
Threads: 123
Joined: Sep 2016
Apr-13-2021, 03:33 AM
(This post was last modified: Apr-13-2021, 03:33 AM by snippsat.)
(Apr-12-2021, 03:06 PM)jimsxxl Wrote: So basiclly request_html is the same as Selenium with headless-option (as far as getting html code) ? Resource wise it will be the same as request_html use pyppeteer(headless) chrome/chromium browser automation.
(Apr-12-2021, 03:06 PM)jimsxxl Wrote: If i would choose to use Selenium this time, would BeautifulSoup be unnessecary then?
I wanted to learn Bs4 in this project, would it be foolish to combine Selenium and BS4 ? It's fine to send browser.page_source to Bs4 and then do parsing with Bs4.
Example:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
import time
options = Options()
options.add_argument( "--headless" )
browser = webdriver.Chrome(executable_path = r 'C:\cmder\bin\chromedriver.exe' , options = options)
browser.get(url)
soup = BeautifulSoup(browser.page_source, 'lxml' )
title = soup.select_one( '#video-title' )
print (title.text)
|
Output: WE FISH THE SAME SPOT FOR 12 HOURS - Amazing Results!! | Team Galant
Posts: 9
Threads: 3
Joined: Apr 2021
(Apr-11-2021, 01:38 PM)jimsxxl Wrote: Hello guys,
Im messing around abit with bs4, im trying to parse some data from Youtube as a "learning-project".
What im finding difficult to understand is, when searching for a element to parse (for example video title)...
what should i be looking at? What is the key to get video-title extracted from the HTML code?
How should i think when i inspect an object in my browser?
What piece of code am i interested in ?
Thank you in advance !
(Apr-13-2021, 03:33 AM)snippsat Wrote: (Apr-12-2021, 03:06 PM)jimsxxl Wrote: So basiclly request_html is the same as Selenium with headless-option (as far as getting html code) ? Resource wise it will be the same as request_html use pyppeteer(headless) chrome/chromium browser automation.
(Apr-12-2021, 03:06 PM)jimsxxl Wrote: If i would choose to use Selenium this time, would BeautifulSoup be unnessecary then?
I wanted to learn Bs4 in this project, would it be foolish to combine Selenium and BS4 ? It's fine to send browser.page_source to Bs4 and then do parsing with Bs4.
Example:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
import time
options = Options()
options.add_argument( "--headless" )
browser = webdriver.Chrome(executable_path = r 'C:\cmder\bin\chromedriver.exe' , options = options)
browser.get(url)
soup = BeautifulSoup(browser.page_source, 'lxml' )
title = soup.select_one( '#video-title' )
print (title.text)
|
Output: WE FISH THE SAME SPOT FOR 12 HOURS - Amazing Results!! | Team Galant
Okey thank you snippsat!
Appreciate all answers alot !
My learning journey continues !
|