Python Forum
Webscraping a site that uses javascript?
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Webscraping a site that uses javascript?
#1
I'm trying to scrape the stock-tickers of the chart embedded on the right of this page.http://investsnips.com/list-of-publicly-...companies/
(under the graph)When inspecting the html, the stock-symbols seem to be embedded here, under title ("NASDAQ:ADMA"), with below representing the code for one symbol:
<td class="symbol-short-name-container" title="NASDAQ:ADMA" style="cursor: 
pointer;"><a href="https://www.tradingview.com/chart/?symbol=NASDAQ%3AADMA" 
target="_blank">ADMA Biologics</a></td>
However, I'm failing to capture this code via find_all.
import bs4 as bs
import urllib.request

import re

source = urllib.request.urlopen('http://investsnips.com/list-of-publicly-traded-micro-cap-diversified-biotechnology-and-pharmaceutical-companies/').read()
soup = bs.BeautifulSoup(source,'lxml')

body = soup.body (#It seems to be under body)
After which
body.find_all('tr',  class_="ticker quote-ticker-inited")
[]  # empty list

body.find_all('td',  class_="symbol-short-name-container")
[] #empty list
So It seems that the site uses Javascript., but I have been scouring a webscraping book (old book) and the net, but I can't seem to figure out what I'm supposed to do. 
Do I need a different module? 


Thank you.
Reply
#2
(Apr-29-2017, 08:03 PM)bigmit37 Wrote: So It seems that the site uses Javascript., but I have been scouring a webscraping book (old book) and the net, but I can't seem to figure out what I'm supposed to do. 
Do I need a different module? 
Yes if the site uses javascript it changes on the fly, so the data doesnt exist when regular urllib.request. You need to automate a browser by using selenium. IF you look at the HTML that you browser gets and the HTML that python gets, and its different you need to use selenium. You can use PhantomJS to "hide" the browser in the background so to speak. All you really need to do since your just parsing the content is just get the HTML via selenium instead, and then pass that to BeuatifulSoup
Recommended Tutorials:
Reply
#3
Thanks. I will look into Selenium now and report back here when I get stuck or accomplish my task.

Actually, I noticed something which is confusing me.  When I use the code soup.findall('p')[-2], the list of tickers seem to be embedded in the returned output. 

 
 

<p><script src="https://d33t3vvu2t2yu5.cloudfront.net/tv.js" type="text/javascript"></script><br/>
<script type="text/javascript">
new TradingView.MiniWidget({
  "container_id": "tv-miniwidget-c316c",
  "tabs": [
    "Micro Cap Biotech"
  ],
  "symbols": {
    "Micro Cap Biotech": [
      [
        "Abeona Thera",
        "NASDAQ:ABEO|3m"
      ],
      [
        "Actinium Pharma",
        "AMEX:ATNM|3m"
      ],
      [
        "ADMA Biologics",
        "NASDAQ:ADMA|3m"
      ],
      [
        "Adverum Biotech",
        "NASDAQ:ADVM|3m"
      ],
      [
        "Aeglea",
        "NASDAQ:AGLE|3m"
      ],
      [
        "Affimed",
        "NASDAQ:AFMD|3m"
      ],
      [
        "Akari Therapeutics",
        "NASDAQ:AKTX|3m"
      ],
      [
        "Alcobra",
        "NASDAQ:ADHD|3m"
      ],
      [
        "Actinium Pharma",
        "AMEX:ATNM|3m"
      ],
      [
        "ADMA Biologics",
        "NASDAQ:ADMA|3m"
      ],
      [
        "Adverum Biotech",
        "NASDAQ:ADVM|3m"
      ],
      [
        "Aeglea",
        "NASDAQ:AGLE|3m"
      ],
      [
        "Affimed",
        "NASDAQ:AFMD|3m"
      ],
      [
        "Akari Therapeutics",
        "NASDAQ:AKTX|3m"
      ],
      [
        "Alcobra",
        "NASDAQ:ADHD|3m"
Does this mean, it's still scrapable with Beautiful Soup?

Thank you.
Reply
#4
Also look at the tutorials by snippsat: https://python-forum.io/Thread-Web-Scraping-part-1
and https://python-forum.io/Thread-Web-scraping-part-2 (this one has examples with selenium)
Reply
#5
(Apr-29-2017, 08:52 PM)Larz60+ Wrote: Also look at the tutorials by snippsat: https://python-forum.io/Thread-Web-Scraping-part-1
and https://python-forum.io/Thread-Web-scraping-part-2 (this one has examples with selenium)

Nice i wasnt aware he had selenium explanation on his tuts.
Recommended Tutorials:
Reply
#6
Okay, I'm using Selenium but I seem to be stuck.



I can't seem to switch to the frame I want.

from selenium import webdriver
from selenium.webdriver.common.keys import Keys

driver = webdriver.Firefox()
driver.get('http://investsnips.com/list-of-publicly-traded-micro-cap-diversified-biotechnology-and-pharmaceutical-companies/')
#driver.find_element_by_xpath('//*[@id="tradingview_4e896"]')
driver.switch_to.frame("tradingview_4e896")
I've tried both the commented line and the switch_to_frame, and I'm endin with a   Nosuch element/Frame error

NoSuchElementException: Message: Unable to locate element: //*[@id="tradingview_4e896"]


This is the frame I'm trying to connect to :


Not sure what I seem to be doing wrong.
Reply
#7
I can't id you are trying to switch to.
Try:
time.sleep(4) # let site load 
driver.switch_to.frame(tradingview_03881)
# Or
driver.switch_to.frame(tradingview_0e8ff)
Reply
#8
(Apr-30-2017, 07:37 PM)snippsat Wrote: I can't id you are trying to switch to.
Try:
time.sleep(4) # let site load 
driver.switch_to.frame(tradingview_03881)
# Or
driver.switch_to.frame(tradingview_0e8ff)




Can't seem to connect to those as well. 

oSuchFrameException: Message: tradingview_0e8ff

I even have time.sleep(20). 


driver.switch_to.frame('tradingview_03881')    #I added quotes around them in my code, as they are strings.


It seems the ID changes each time we engage with the website. I wonder if I can use RE to deal with that.
Reply
#9
I see that id name change on reload.
tradingview is same every time.
You can try:
driver.switch_to_frame(driver.find_element_by_partial_link_text("tradingview"))
Reply
#10
(Apr-30-2017, 08:53 PM)snippsat Wrote: I see that id name change on reload.
tradingview is same every time.
You can try:
driver.switch_to_frame(driver.find_element_by_partial_link_text("tradingview"))



Okay, I actually got to the next step using the xpath which I copied via firebug and I was able to find all the elements I was interested in. 


driver.find_element_by_xpath('/html/body/div[1]/div/div/article/div/div[2]/div/div[2]/div[1]/div/div/div[1]/iframe')



Thank you so much for your help.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Webscraping news articles by using selenium cate16 7 2,960 Aug-28-2023, 09:58 AM
Last Post: snippsat
  Webscraping with beautifulsoup cormanstan 3 1,852 Aug-24-2023, 11:57 AM
Last Post: snippsat
  Webscraping returning empty table Buuuwq 0 1,353 Dec-09-2022, 10:41 AM
Last Post: Buuuwq
  WebScraping using Selenium library Korgik 0 1,022 Dec-09-2022, 09:51 AM
Last Post: Korgik
  How to get rid of numerical tokens in output (webscraping issue)? jps2020 0 1,915 Oct-26-2020, 05:37 PM
Last Post: jps2020
  Python Webscraping with a Login Website warriordazza 0 2,571 Jun-07-2020, 07:04 AM
Last Post: warriordazza
  Help with basic webscraping Captain_Snuggle 2 3,878 Nov-07-2019, 08:07 PM
Last Post: kozaizsvemira
  Can't Resolve Webscraping AttributeError Hass 1 2,263 Jan-15-2019, 09:36 PM
Last Post: nilamo
  How to exclude certain links while webscraping basis on keywords Prince_Bhatia 0 3,199 Oct-31-2018, 07:00 AM
Last Post: Prince_Bhatia
  Webscraping homework Ghigo1995 1 2,609 Sep-23-2018, 07:36 PM
Last Post: nilamo

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020