Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Challenging BS4 Problem
#1
Hey guys, I'm trying to improve my scraping abilities, so I've been practicing and trying to scrape some tough to obtain data. I've run into issues with the following:

http://www.foxnews.com/tech/2018/01/11/h...tline.html

Elements to scrape:

  1. Article Time Published [data-time-published]
  2. Message Count [param-0 param-messagesCount]
  3. Message Time Stamp [message-timestamp]

The 'time published' is being updated by ajax or js, or some other kind of sorcery. Everything in the comments section appears to be loaded the same way.

So my question is.. how would you guys approach solving this problem? I'm currently using requests. Should I load the page with selenium? Would that make it easier to scrape?

Any advice is greatly appreciated.

[Image: project.jpg]
Reply
#2
Using a browser (e.g. with selenium) is one of the options you have.

Another is figuring out where the data comes from and grabbing it yourself (the comments are coming from a json api, so you wouldn't have any problem parsing).
This would probably result in simpler, more efficient code, but might take digging around to find the data.
Reply
#3
I managed to scrape the article post date using some PyQt5 code I found..

import bs4 as bs
import sys
from PyQt5.QtWebEngineWidgets import QWebEnginePage
from PyQt5.QtWidgets import QApplication
from PyQt5.QtCore import QUrl

class Page(QWebEnginePage):


    def __init__(self, url):
        self.app = QApplication(sys.argv)
        QWebEnginePage.__init__(self)
        self.html = ''
        self.loadFinished.connect(self._on_load_finished)
        self.load(QUrl(url))
        self.app.exec_()

    def _on_load_finished(self):
        self.html = self.toHtml(self.Callable)

    def Callable(self, html_str):
        self.html = html_str
        self.app.quit()


def main():
    page = Page('http://www.foxnews.com/tech/2018/01/11/how-deadly-drone-swarms-will-help-us-troops-on-frontline.html')
    soup = bs.BeautifulSoup(page.html, 'html.parser')
    js_test = soup.find('div', class_='article-date')

    print(js_test.text)

if __name__ == '__main__': main()
If I print the soup (results here: https://pastebin.com/f9na0nNc) It's not loading the comments section results. It has some kind of JS detection that appears to only load comments if the user scrolls down..

<script type="text/x-spotim-options">SPOTIM_OPTIONS = {"prerenderDeferred":true}</script><script src="https://app-cdn.spot.im/modules/prerender/3.1.385/conversation/host/host-bundle.js"></script><script data-ready="true" data-spotim-module="essential" src="https://app-cdn.spot.im/modules/essential/3.1.385/bundle.js" type="text/javascript"></script><script type="text/javascript">if (!window['SPOTIM_AUTH0_ENABLED']) {
        (function () {
            //VERSION 3
            var intervalId = -1;
            var CHECKER_INTERVAL = 1000;
            var JANRAIN_CAPTURE_TOKEN_KEY = 'janrainCaptureToken';
            var captureToken = '';
 
            function startJanrainListener() {
                if (typeof window.localStorage !== 'undefined') {
                    tokenChecker();
                }
            }
 
            function tokenChecker() {
                intervalId = window.setInterval(function () {
                    var captureTokenResult = localStorage.getItem(JANRAIN_CAPTURE_TOKEN_KEY);
                    if (captureTokenResult !== captureToken) {
                        onTokenChanged(captureTokenResult, captureToken);
                        captureToken = captureTokenResult;
                    }
                }, CHECKER_INTERVAL)
            }
 
 
            function onTokenChanged(newToken, oldToken) {
                //Don't remove listener due to multiple login possibility
                // window.clearInterval(intervalId);
 
                if (newToken) {
                    console.log('Janrain user detected');
                    if (window.SPOTIM && window.SPOTIM.getConversations && window.SPOTIM.getConversations().length > 0) {
                        window.SPOTIM.startSSOForProvider({provider: 'janrain', token: newToken});
                    } else {
                        document.addEventListener('spot-im-conversation-loaded', function () {
                            if (!!localStorage.getItem(JANRAIN_CAPTURE_TOKEN_KEY)) {
                                window.SPOTIM.startSSOForProvider({provider: 'janrain', token: newToken});
                            }
                        }, false)
                    }
 
                } else {
                    if (oldToken) {
                        console.log('Janrain user logged out');
                        if (window.SPOTIM && window.SPOTIM.getConversations && window.SPOTIM.getConversations().length > 0) {
                            window.SPOTIM.logout({provider: 'janrain', token: newToken});
                        } else {
                            document.addEventListener('spot-im-conversation-loaded', function () {
                                if (!localStorage.getItem(JANRAIN_CAPTURE_TOKEN_KEY)) {
                                    window.SPOTIM.logout({provider: 'janrain', token: newToken});
                                }
                            }, false)
                        }
                    }
                    else {
                        console.log('No janrain user detected');
                        if (window.SPOTIM && window.SPOTIM.logout) {
                            window.SPOTIM.logout({provider: 'janrain', token: null});
                        } else {
                            document.addEventListener('spot-im-conversation-loaded', function () {
                                if (!localStorage.getItem(JANRAIN_CAPTURE_TOKEN_KEY)) {
                                    window.SPOTIM.logout({provider: 'janrain', token: null});
                                }
                            }, false)
                        }
                    }
                }
            }
 
            startJanrainListener();
 
        })();
    }</script>
Can I force "prerenderDeferred":false? Would that work?

Is there any other way I can get the comments to load without resorting to using Selenium?
Reply
#4
(Jan-16-2018, 01:18 PM)digitalmatic7 Wrote: Is there any other way I can get the comments to load without resorting to using Selenium?
Yes but it can be very difficult because you have to understand JavaScript and do revere engineer of what happens.
Use Selenium is the best way,and not the Qt WebEngine you have found.

Both Chrome and FireFox are developing headless drivers for use in Selenium.
PhantomJS also still work as headless for Selenium.
from selenium import webdriver
import time

browser = webdriver.Chrome()
#browser = webdriver.PhantomJS()
url = 'http://www.foxnews.com/tech/2018/01/11/how-deadly-drone-swarms-will-help-us-troops-on-frontline.html'
browser.get(url)
time_ago = browser.find_element_by_xpath('//*[@id="wrapper"]/div/div[2]/div/main/article/header/div[1]/div[2]/time')
print(time_ago.text)--> 4 days ago
Headless Chrome:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
 
#-- Setup
chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument('--log-level=3')
browser = webdriver.Chrome(chrome_options=chrome_options, executable_path=r'E:\1py_div\div_code\scrape\chromedriver.exe')
#-- Parse
url = 'http://www.foxnews.com/tech/2018/01/11/how-deadly-drone-swarms-will-help-us-troops-on-frontline.html'
browser.get(url)
time_ago = browser.find_element_by_xpath('//*[@id="wrapper"]/div/div[2]/div/main/article/header/div[1]/div[2]/time')
print(time_ago.text) --> 4 days ago
browser.quit()
The discussion is done in iframe so need to driver.switch_to_frame
Or as in a other post you had,i did show how to take out json from discussion.
Reply
#5
(Jan-16-2018, 02:12 PM)snippsat Wrote: The discussion is done in iframe so need to driver.switch_to_frame
Or as in a other post you had,i did show how to take out json from discussion.

I can't believe I forgot to check if it was in an iframe Doh thanks for the reminder! All the data is in there.

I have to keep working at it so one day I can reverse engineer everything that's my long term goal Smile
Reply
#6
Unfortunately it's not so easy.

Here's my code attempting to get into the right frame for the comments section:

import bs4 as bs
from selenium import webdriver
driver = webdriver.Chrome("C:\\Program Files (X86)\\Google\\chromedriver.exe")
from selenium.webdriver.common.keys import Keys  # allows focus on element
from time import sleep

driver.get('http://www.foxnews.com/tech/2018/01/11/how-deadly-drone-swarms-will-help-us-troops-on-frontline.html')

soup = bs.BeautifulSoup(driver.page_source, 'html.parser')
js_test1 = soup.find('div', class_='article-date')

for handle in driver.window_handles: # the comments iframe is located in here
    driver.switch_to.window(handle)

# after getting into the right window I tried switching to the frame but I had no luck

soup = bs.BeautifulSoup(driver.page_source, 'html.parser')
print(soup)
Here's some of what I tried and the Iframe code is below as well:

#driver.switch_to.frame(driver.find_element_by_partial_link_text("spoxy-shard4.spot.im"))
#driver.switch_to.frame(driver.find_element_by_tag_name("iframe"))
#driver.switch_to.frame(driver.find_elements_by_xpath('//*[@src="spoxy-shard4.spot.im"]'))

#<div class="sppre_frame-container" id="57d17ca964f60f312990415c235007c8">
#<iframe id="57d17ca964f60f312990415c235007c8-iframe" scrolling="no" src="https://spoxy-shard4.spot.im/v2/spot/sp_ANQXRpqH/post/2817a523-be07-42af-a113-411fcc5f1ace/?elementId=b6391be5edb91e7cd24ca7d13fb36216&amp;spot_im_platform=desktop&amp;host_url=http%3A%2F%2Fwww.foxnews.com%2Ftech%2F2018%2F01%2F11%2Fhow-deadly-drone-swarms-will-help-us-troops-on-frontline.html&amp;host_url_64=aHR0cDovL3d3dy5mb3huZXdzLmNvbS90ZWNoLzIwMTgvMDEvMTEvaG93LWRlYWRseS1kcm9uZS1zd2FybXMtd2lsbC1oZWxwLXVzLXRyb29wcy1vbi1mcm9udGxpbmUuaHRtbA%3D%3D&amp;spot_im_ph__prerender_deferred=true&amp;prerenderDeferred=true&amp;sort_by=newest&amp;spot_im_ih__livefyre_url=2817a523-be07-42af-a113-411fcc5f1ace&amp;isStarsRatingEnabled=false&amp;enableMessageShare=true&amp;enableAnonymize=true&amp;isConversationLiveBlog=false&amp;enableSeeMoreButton=true" style="overflow: hidden;">
#</iframe>
#</div>
Reply


Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020