Python Forum

Full Version: Webscraping a site that uses javascript?
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
I'm trying to scrape the stock-tickers of the chart embedded on the right of this page.http://investsnips.com/list-of-publicly-...companies/
(under the graph)When inspecting the html, the stock-symbols seem to be embedded here, under title ("NASDAQ:ADMA"), with below representing the code for one symbol:
<td class="symbol-short-name-container" title="NASDAQ:ADMA" style="cursor: 
pointer;"><a href="https://www.tradingview.com/chart/?symbol=NASDAQ%3AADMA" 
target="_blank">ADMA Biologics</a></td>
However, I'm failing to capture this code via find_all.
import bs4 as bs
import urllib.request

import re

source = urllib.request.urlopen('http://investsnips.com/list-of-publicly-traded-micro-cap-diversified-biotechnology-and-pharmaceutical-companies/').read()
soup = bs.BeautifulSoup(source,'lxml')

body = soup.body (#It seems to be under body)
After which
body.find_all('tr',  class_="ticker quote-ticker-inited")
[]  # empty list

body.find_all('td',  class_="symbol-short-name-container")
[] #empty list
So It seems that the site uses Javascript., but I have been scouring a webscraping book (old book) and the net, but I can't seem to figure out what I'm supposed to do. 
Do I need a different module? 


Thank you.
(Apr-29-2017, 08:03 PM)bigmit37 Wrote: [ -> ]So It seems that the site uses Javascript., but I have been scouring a webscraping book (old book) and the net, but I can't seem to figure out what I'm supposed to do. 
Do I need a different module? 
Yes if the site uses javascript it changes on the fly, so the data doesnt exist when regular urllib.request. You need to automate a browser by using selenium. IF you look at the HTML that you browser gets and the HTML that python gets, and its different you need to use selenium. You can use PhantomJS to "hide" the browser in the background so to speak. All you really need to do since your just parsing the content is just get the HTML via selenium instead, and then pass that to BeuatifulSoup
Thanks. I will look into Selenium now and report back here when I get stuck or accomplish my task.

Actually, I noticed something which is confusing me.  When I use the code soup.findall('p')[-2], the list of tickers seem to be embedded in the returned output. 

 
 

<p><script src="https://d33t3vvu2t2yu5.cloudfront.net/tv.js" type="text/javascript"></script><br/>
<script type="text/javascript">
new TradingView.MiniWidget({
  "container_id": "tv-miniwidget-c316c",
  "tabs": [
    "Micro Cap Biotech"
  ],
  "symbols": {
    "Micro Cap Biotech": [
      [
        "Abeona Thera",
        "NASDAQ:ABEO|3m"
      ],
      [
        "Actinium Pharma",
        "AMEX:ATNM|3m"
      ],
      [
        "ADMA Biologics",
        "NASDAQ:ADMA|3m"
      ],
      [
        "Adverum Biotech",
        "NASDAQ:ADVM|3m"
      ],
      [
        "Aeglea",
        "NASDAQ:AGLE|3m"
      ],
      [
        "Affimed",
        "NASDAQ:AFMD|3m"
      ],
      [
        "Akari Therapeutics",
        "NASDAQ:AKTX|3m"
      ],
      [
        "Alcobra",
        "NASDAQ:ADHD|3m"
      ],
      [
        "Actinium Pharma",
        "AMEX:ATNM|3m"
      ],
      [
        "ADMA Biologics",
        "NASDAQ:ADMA|3m"
      ],
      [
        "Adverum Biotech",
        "NASDAQ:ADVM|3m"
      ],
      [
        "Aeglea",
        "NASDAQ:AGLE|3m"
      ],
      [
        "Affimed",
        "NASDAQ:AFMD|3m"
      ],
      [
        "Akari Therapeutics",
        "NASDAQ:AKTX|3m"
      ],
      [
        "Alcobra",
        "NASDAQ:ADHD|3m"
Does this mean, it's still scrapable with Beautiful Soup?

Thank you.
Also look at the tutorials by snippsat: https://python-forum.io/Thread-Web-Scraping-part-1
and https://python-forum.io/Thread-Web-scraping-part-2 (this one has examples with selenium)
(Apr-29-2017, 08:52 PM)Larz60+ Wrote: [ -> ]Also look at the tutorials by snippsat: https://python-forum.io/Thread-Web-Scraping-part-1
and https://python-forum.io/Thread-Web-scraping-part-2 (this one has examples with selenium)

Nice i wasnt aware he had selenium explanation on his tuts.
Okay, I'm using Selenium but I seem to be stuck.



I can't seem to switch to the frame I want.

from selenium import webdriver
from selenium.webdriver.common.keys import Keys

driver = webdriver.Firefox()
driver.get('http://investsnips.com/list-of-publicly-traded-micro-cap-diversified-biotechnology-and-pharmaceutical-companies/')
#driver.find_element_by_xpath('//*[@id="tradingview_4e896"]')
driver.switch_to.frame("tradingview_4e896")
I've tried both the commented line and the switch_to_frame, and I'm endin with a   Nosuch element/Frame error

NoSuchElementException: Message: Unable to locate element: //*[@id="tradingview_4e896"]


This is the frame I'm trying to connect to :


Not sure what I seem to be doing wrong.
I can't id you are trying to switch to.
Try:
time.sleep(4) # let site load 
driver.switch_to.frame(tradingview_03881)
# Or
driver.switch_to.frame(tradingview_0e8ff)
(Apr-30-2017, 07:37 PM)snippsat Wrote: [ -> ]I can't id you are trying to switch to.
Try:
time.sleep(4) # let site load 
driver.switch_to.frame(tradingview_03881)
# Or
driver.switch_to.frame(tradingview_0e8ff)




Can't seem to connect to those as well. 

oSuchFrameException: Message: tradingview_0e8ff

I even have time.sleep(20). 


driver.switch_to.frame('tradingview_03881')    #I added quotes around them in my code, as they are strings.


It seems the ID changes each time we engage with the website. I wonder if I can use RE to deal with that.
I see that id name change on reload.
tradingview is same every time.
You can try:
driver.switch_to_frame(driver.find_element_by_partial_link_text("tradingview"))
(Apr-30-2017, 08:53 PM)snippsat Wrote: [ -> ]I see that id name change on reload.
tradingview is same every time.
You can try:
driver.switch_to_frame(driver.find_element_by_partial_link_text("tradingview"))



Okay, I actually got to the next step using the xpath which I copied via firebug and I was able to find all the elements I was interested in. 


driver.find_element_by_xpath('/html/body/div[1]/div/div/article/div/div[2]/div/div[2]/div[1]/div/div/div[1]/iframe')



Thank you so much for your help.