Python Forum

Full Version: parsing table
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2
(Apr-27-2018, 12:16 PM)ian Wrote: [ -> ]When I use 'Inspect element' of IE11, I can see all tags in that table.
Sure there are tags when look in browser.
Remember what you see in browser(Inspect element) is the rendered version of site also with JavaScript.
The whole table is generated bye JavaScript in DOM of browser.
So if turn off JavaScript in browser,you will not see any table.

Tool like Requests,BeautifulSoup.lxml can not render JavaScript(DOM) as browser dos.
So they will not return anything.

Solution Selenium can to full browser automation.
As mention bye @nilamo looking at source an try to find JSON return.
Site has only news API ,so have to figure out call yourself.

As i look at this can give some examples.
import requests

headers = {
    'pragma': 'no-cache',
    'origin': 'https://www.theglobeandmail.com',
    'accept-encoding': 'gzip, deflate, br',
    'accept-language': 'nb-NO,nb;q=0.9,no;q=0.8,nn;q=0.7,en-US;q=0.6,en;q=0.5',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.117 Safari/537.36',
    'content-type': 'application/x-www-form-urlencoded',
    'accept': '*/*',
    'cache-control': 'no-cache',
    'authority': 'globeandmail.pl.barchart.com',
    'referer': 'https://www.theglobeandmail.com/investing/markets/stocks/market-leaders/',
}

data = [
  ('fields', 'symbol,symbolName,lastPrice,priceChange,percentChange,priceVolume,tradeTime'),
  ('lists', 'stocks.volumeLeaders.price-volume.tsx'),
]

response = requests.post('https://globeandmail.pl.barchart.com/module/dataTable.json', headers=headers, data=data)
json_data = response.json() 
Now can test JSON return:
>>> json_data['data'][0]
{'lastPrice': '97.18',
 'percentChange': '+0.59%',
 'priceChange': '+0.57',
 'priceVolume': '340,592',
 'raw': {'lastPrice': 97.18,
         'percentChange': 0.0059,
         'priceChange': 0.57,
         'priceVolume': 340592,
         'symbol': 'RY.TO',
         'symbolName': 'Royal Bank of Canada',
         'symbolType': 6,
         'tradeTime': 1524778800},
 'symbol': 'RY-T',
 'symbolName': 'Royal Bank of Canada',
 'symbolType': 6,
 'tradeTime': '04/26/18'}
>>> json_data['data'][1]
{'lastPrice': '49.73',
 'percentChange': '+1.08%',
 'priceChange': '+0.53',
 'priceVolume': '215,442',
 'raw': {'lastPrice': 49.73,
         'percentChange': 0.0108,
         'priceChange': 0.53,
         'priceVolume': 215442,
         'symbol': 'SU.TO',
         'symbolName': 'Suncor Energy Inc',
         'symbolType': 6,
         'tradeTime': 1524778800},
 'symbol': 'SU-T',
 'symbolName': 'Suncor Energy Inc',
 'symbolType': 6,
 'tradeTime': '04/26/18'}

Selenium look at Web-scraping part-2,
this is a headless setup which mean that the browser is not loading.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
import time

#--| Setup
chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument('--disable-gpu')
chrome_options.add_argument('--log-level=3')
browser = webdriver.Chrome(chrome_options=chrome_options, executable_path=r'chromedriver.exe')
#--| Parse
url = 'https://www.theglobeandmail.com/investing/markets/stocks/market-leaders/'
browser.get(url)
soup = BeautifulSoup(browser.page_source, 'lxml')
tbody = soup.find('tbody')
first_row = tbody.find('tr')
first_value = first_row.find_all('barchart-field', attrs={"name": "lastPrice"})
print(first_value[0].text)
Output:
97.18
Pages: 1 2