Python Forum
I tried every way to scrap morningstar financials data without success so far
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
I tried every way to scrap morningstar financials data without success so far
#1
I tried every way I can think of and search out in an attempt to scrap morningstar financials data into any processable form like csv or dataframe, for instance from here:
https://financials.morningstar.com/ratios/r.html?t=AAPL

There are a number of possible ways to achieve that. One way is to find the exact URL of that csv file. A number of online resources hint that the link should be
http://financials.morningstar.com/ajax/e...&order=asc
But it doesn't work now.

Another way is to automate the "Export" button clicking while the program opens the website through webdriver. Many resources point to this or similar solutions:

from selenium import webdriver

d = webdriver.Chrome()
d.get('http://financials.morningstar.com/ratios/r.html?t=AAPL&region=usa&culture=en-US')
d.find_element_by_css_selector('.large_button').click()
d.quit()
I got no error or exception upon running this, but no file is downloaded afterwards. Other suggested value variations for the css_selector function don't work as well, I tested everything I saw.

A third way is to scrap data from another primary source:
http://financials.morningstar.com/finan/...xxx&t=AAPL

from bs4 import BeautifulSoup
import requests
import re
import json

url1 = 'http://financials.morningstar.com/finan/financials/getFinancePart.html?&callback=xxx&t=AAPL'
url2 = 'http://financials.morningstar.com/finan/financials/getKeyStatPart.html?&callback=xxx&t=AAPL'

soup1 = BeautifulSoup(json.loads(re.findall(r'xxx\((.*)\)', requests.get(url1).text)[0])['componentData'], 'lxml')
soup2 = BeautifulSoup(json.loads(re.findall(r'xxx\((.*)\)', requests.get(url2).text)[0])['componentData'], 'lxml')

def print_table(soup):
    for i, tr in enumerate(soup.select('tr')):
        row_data = [td.text for td in tr.select('td, th') if td.text]
        if not row_data:
            continue
        if len(row_data) < 12:
            row_data = ['X'] + row_data
        for j, td in enumerate(row_data):
            if j==0:
                print('{: >30}'.format(td), end='|')
            else:
                print('{: ^12}'.format(td), end='|')
        print()

print_table(soup1)
print()
print_table(soup2)
Credit here: https://stackoverflow.com/questions/5669...orningstar

It works! The table is beautifully printed and it entails the information I want. The only problem is that I have no idea how to change it into processable format like csv or dataframe.
How can I do that? Any help would be very appreciated.
Reply
#2
(Oct-19-2020, 05:25 PM)sparkt Wrote: he table is beautifully printed and it entails the information I want. The only problem is that I have no idea how to change it into processable format like csv or dataframe.
As html has a table the can use pd.read_html to get that table.
Here a demo Notebook.
Reply
#3
It's good to learn about that, thanks!
I edited the codes using BeautifulSoup and am able to get a dataframe now. Not as difficult as I thought!
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Web scrap multiple pages anilacem_302 3 1,101 Jul-01-2020, 07:50 PM
Last Post: mlieqo
  Need logic on how to scrap 100K URLs goodmind 2 366 Jun-29-2020, 09:53 AM
Last Post: goodmind
  Scrap a dynamic span hefaz 0 937 Mar-07-2020, 02:56 PM
Last Post: hefaz
  scrap by defining 3 functions zarize 0 466 Feb-18-2020, 03:55 PM
Last Post: zarize
  Skipping anti-scrap zarize 0 534 Jan-17-2020, 11:51 AM
Last Post: zarize
  Cannot get selenium to scrap past the first two pages newbie_programmer 0 1,817 Dec-12-2019, 06:19 AM
Last Post: newbie_programmer
  Scrap data from not standarized page? zarize 4 933 Nov-25-2019, 10:25 AM
Last Post: zarize
  page impossible to scrap? :O zarize 2 1,695 Oct-03-2019, 02:44 PM
Last Post: zarize
  Scrap a value from website harsush 1 641 Aug-29-2019, 01:57 PM
Last Post: snippsat
  Scrap text out of td table from URLS Gochix2020 4 2,030 Aug-03-2019, 02:56 AM
Last Post: Larz60+

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020