scraping video src fail - Printable Version

scraping video src fail - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: General Coding Help (https://python-forum.io/forum-8.html)
+--- Thread: scraping video src fail (/thread-34242.html)

scraping video src fail - jacklee26 - Jul-10-2021

HI I have a question related to HTML readers. I try to use selenium to scrap the video, but it seems not working. I just want to grab the src link. i use the chrome development tool and is able to find it, but when I used XPath in selenium will pop an error.
i saw it related to https://stackoverflow.com/questions/31298016/htmlreader-having-difficulty-scraping-video-links.

When i used driver.find_element_by_xpath , it occur error
below is HTML

<html>
 <head>
  <body>
     <div>
        <html>
          <head>
              <body>
                <div id="container" >
                    <div tableindex= -1>
                        <video id="my_video_1_html5_api" class="vjs-tech" controlslist="nodownload" preload="auto" poster="https://vs02.520call.me/files/mp4/f/ttttt.jpg" <source src="https://vs02.520call.me/files/mp4/f/fwnOT.m3u8?t=1234444”  type="application/vnd.apple.mpegurl"></video>

i try with this three way unable to grab the src link, but none of them is able to geable the src

driver.find_element_by_xpath(‘//*[@id=“my_video_1_html5_api”]’)
driver.find_element_by_xpath('//*[@class="vjs-tech"]')
driver.find_element_by_id('lmy_video_1_html5_api')

it occur error:
selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element: {"method":"css selector","selector":"[id="lmy_video_1_html5_api"]"}
(Session info: chrome=91.0.4472.114)

RE: scraping video src fail - snippsat - Jul-10-2021

jacklee26 Wrote:{"method":"css selector","selector":"[id="lmy_video_1_html5_api"]"}

Spelling:
lmy_video_1_html5_api

XPath //*[@id="my_video_1_html5_api"]

RE: scraping video src fail - jacklee26 - Jul-10-2021

(Jul-10-2021, 02:16 PM)snippsat Wrote:
jacklee26 Wrote:{"method":"css selector","selector":"[id="lmy_video_1_html5_api"]"}
Spelling:
lmy_video_1_html5_api

XPath //*[@id="my_video_1_html5_api"]

selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element: {"method":"xpath","selector":"//*[@id='my_video_1_html5_api']"}
(Session info: chrome=91.0.4472.114)

 driver.find_element_by_xpath("//*[@id='my_video_1_html5_api']")

still not working

RE: scraping video src fail - snippsat - Jul-10-2021

Can do local test with html code you have posted.
local3.html

<!DOCTYPE html>
<html>
<head>
  <meta charset="UTF-8" />
  <title>Test site</title>
<head/>
<body>    
  <p id='foo'>hello world</p>
  <div id="container" >
  <div tableindex= -1>
   <video id="my_video_1_html5_api" class="vjs-tech" controlslist="nodownload" preload="auto" poster="https://vs02.520call.me/files/mp4/f/ttttt.jpg" <source src="https://vs02.520call.me/files/mp4/f/fwnOT.m3u8?t=1234444”  type="application/vnd.apple.mpegurl"></video>
</body>
</html>

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

#--| Setup
options = Options()
#options.add_argument("--headless")
browser = webdriver.Chrome(executable_path=r'C:\cmder\bin\chromedriver.exe', options=options)
#--| Parse or automation
browser.get('file:///E:/div_code/scrape/local3.html')
p_tag = browser.find_elements_by_id('foo')
print(p_tag[0].text)

video_tag = browser.find_elements_by_xpath('//*[@id="my_video_1_html5_api"]')
print(video_tag)

Output:hello world
[<selenium.webdriver.remote.webelement.WebElement (session="1a81b871d83bdeb1e947cc8f1a074be6", element="cdf7afaf-c573-4f49-aee8-66bd8a4d3b40")>]

So it work the video tag will have no text have to use get_attribute to get info out.

>>> video_tag
[<selenium.webdriver.remote.webelement.WebElement (session="1a81b871d83bdeb1e947cc8f1a074be6", element="cdf7afaf-c573-4f49-aee8-66bd8a4d3b40")>]
>>> video_tag[0].text
''
>>> 
>>> video_tag[0].get_attribute('src')
'https://vs02.520call.me/files/mp4/f/fwnOT.m3u8?t=1234444%E2%80%9D%20%20type='
>>> video_tag[0].get_attribute('poster')
'https://vs02.520call.me/files/mp4/f/ttttt.jpg

RE: scraping video src fail - jacklee26 - Jul-11-2021

what if the HTML page has double html like this, I went to get but it is empty

<head/>
<body data-rsssl="1">
<div class ="wrapper section medium-padding">
  <div class = "section-inner">
  <div class = "content fleft">
  <div id="post-1111">
    <div class="post-content">
      <div style="text-align:center;">
        <div style="text-align:center;" class="player_wrapper">
          <iframe id="allmyplayer" name="allmyplayer">
   #document
          <html>
          <head>www </head>
            <body>
                 <div id="container">
                   <div tableindex= -1>
                      <video id="my_video_1_html5_api" class="vjs-tech" controlslist="nodownload" preload="auto" poster="https://vs02.520call.me/files/mp4/1/13cDq.jpg" data-setup="{&quot;example_option&quot;:true, &quot;inactivityTimeout&quot;: 0}" tabindex="-1" src="blob:https://video.520call.me/bce3f239-3627-4893-a458-5605f3eb978b"><source src="https://vs02.520call.me/files/mp4/1/13cDq.m3u8?t=1625961526" type="application/vnd.apple.mpegurl"></video>
                   </div>
                 </div>

Output:>>> video_tag = driver.find_elements_by_xpath('//*[@id="my_video_1_html5_api"]/source')
>>> video_tag
[]

RE: scraping video src fail - snippsat - Jul-11-2021

(Jul-11-2021, 02:32 AM)jacklee26 Wrote: what if the HTML page has double html like this, I went to get but it is empty

Adding /source will break the XPath.
There is tag that makes this task different,which is iframe.
Also a common mistake is not given page time to load,i use time.sleep as first test,there is Waits that deal with this.

So to test code give,on real page may need to switch window browser.switch_to.frame(iframe).
I get source text from iframe,then can parse that text(is now just text not html) with BS to get tag wanted.

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from time import sleep
from bs4 import BeautifulSoup

#--| Setup
options = Options()
#options.add_argument("--headless")
browser = webdriver.Chrome(executable_path=r'C:\cmder\bin\chromedriver.exe', options=options)
#--| Parse or automation
browser.get('file:///E:/div_code/scrape/local4.html')
sleep(3)
video_tag = browser.find_elements_by_xpath('//*[@id="allmyplayer"]')
print(video_tag)
# Send text html to BS for parse
soup = BeautifulSoup(video_tag[0].text, 'lxml')
print(soup.find('source').get('src', 'Not Found'))

Output:[<selenium.webdriver.remote.webelement.WebElement (session="d0f2629448eb9fb9baabc6dc77342fb9", element="7cc91c06-2934-4243-9645-a334805ce2c4")>]
https://vs02.520call.me/files/mp4/1/13cDq.m3u8?t=1625961526