getting financial data from yahoo finance

asiaphone12 · (This post was last modified: May-25-2020, 12:07 AM by asiaphone12.)

hello, I have problem when scraping data from yahoo finance. I search the forum but all I find is about stock data, not financial data. I want to get the Income Statement, Balance Sheet and Cash Flow for valuation.

here is the code (credit to Matt Button):

from datetime import datetime
import lxml
from lxml import html
import requests
import numpy as np
import pandas as pd

symbol = 'INDF.JK'

url = 'https://finance.yahoo.com/quote/' + symbol + '/balance-sheet?p=' + symbol

# Set up the request headers.
headers = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
    'Accept-Encoding': 'gzip, deflate, br',
    'Accept-Language': 'en-US,en;q=0.9',
    'Cache-Control': 'max-age=0',
    'Pragma': 'no-cache',
    'Referrer': 'https://google.com',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36'
}

# Fetching the page.
page = requests.get(url, headers)

# Parse the page with LXML.
tree = html.fromstring(page.content)


table_rows = tree.xpath("//div[contains(@class, 'D(tbr)')]")


# Ensure that some table rows are found.
assert len(table_rows) > 0

parsed_rows = []

for table_row in table_rows:
    parsed_row = []
    # ~ print(table_row)
    el = table_row.xpath("./div")
    none_count = 0
    
    for rs in el:
        try:
            (text,) = rs.xpath('.//span/text()[1]')
            parsed_row.append(text)
        except ValueError:
            parsed_row.append(np.NaN)
            none_count += 1

    if (none_count < 4):
        parsed_rows.append(parsed_row)

df = pd.DataFrame(parsed_rows)
print(df)

it give this output:

Output:                                          0               1               2               3               4
0                                 Breakdown      12/31/2019      12/31/2018      12/31/2017      12/31/2016
1                              Total Assets  96,198,559,000  96,537,796,000  87,939,488,000  82,174,515,000
2   Total Liabilities Net Minority Interest  41,996,071,000  46,620,996,000  41,182,764,000  38,233,092,000
3      Total Equity Gross Minority Interest  54,202,488,000  49,916,800,000  46,756,724,000  43,941,423,000
4                      Total Capitalization  46,732,924,000  41,103,855,000  42,785,937,000  40,862,141,000
5                       Common Stock Equity  37,777,948,000  33,614,280,000  31,178,844,000  28,974,286,000
6                       Net Tangible Assets  31,461,529,000  27,157,067,000  25,379,979,000  22,667,765,000
7                           Working Capital   6,716,583,000   2,068,516,000  10,877,636,000   9,766,002,000
8                          Invested Capital  60,755,105,000  63,341,015,000  55,496,540,000  51,385,909,000
9                       Tangible Book Value  31,461,529,000  27,157,067,000  25,379,979,000  22,667,765,000
10                               Total Debt  22,977,157,000  29,726,735,000  24,317,696,000  22,411,623,000
11                                 Net Debt   9,232,039,000  20,917,482,000  10,627,698,000   9,049,387,000
12                             Share Issued       8,780,427       8,780,427       8,780,427       8,780,427
13                   Ordinary Shares Number       8,780,427       8,780,427       8,780,427       8,780,427


------------------
(program exited with code: 0)

Press any key to continue . . .

it did not get the complete data such as cash and cash equivalent, inventory, and so on.
when I try to download the web page, and then parse it, it give complete data.

from datetime import datetime
import lxml
from lxml import html
import requests
import numpy as np
import pandas as pd
import os

os.chdir(r'D:\ahmad\python\web')
with open('INDF.JK 6,425.00 -325.00 -4.81% Indofood Sukses Makmur Tbk. - Yahoo Finance.html') as a:
	page = a.read()

tree = html.fromstring(page)

table_rows = tree.xpath("//div[contains(@class, 'D(tbr)')]")

assert len(table_rows) > 0

parsed_rows = []

for table_row in table_rows:
    parsed_row = []
    # ~ print(table_row)
    el = table_row.xpath("./div")
    none_count = 0
    
    for rs in el:
        try:
            (text,) = rs.xpath('.//span/text()[1]')
            parsed_row.append(text)
        except ValueError:
            parsed_row.append(np.NaN)
            none_count += 1

    if (none_count < 4):
        parsed_rows.append(parsed_row)

df = pd.DataFrame(parsed_rows)
print(df)

the output:

Output:                                                  0               1               2               3               4
0                                         Breakdown      12/30/2019      12/30/2018      12/30/2017      12/30/2016
1                                      Total Assets  96,198,559,000  96,537,796,000  87,939,488,000  82,174,515,000
2                                    Current Assets  31,403,445,000  33,272,618,000  32,515,399,000  28,985,443,000
3   Cash, Cash Equivalents & Short Term Investments  13,800,610,000  12,928,189,000  14,490,157,000  13,896,374,000
4                         Cash And Cash Equivalents  13,745,118,000   8,809,253,000  13,689,998,000  13,362,236,000
5                                              Cash   4,714,869,000   4,489,205,000   3,564,920,000   4,251,630,000
6                                  Cash Equivalents   9,030,249,000   4,320,048,000  10,125,078,000   9,110,606,000
7                      Other Short Term Investments      55,492,000   4,118,936,000     800,159,000     534,138,000
8                                         Inventory   9,658,705,000  11,644,156,000   9,690,981,000   8,469,821,000
9                                    Prepaid Assets   1,262,100,000   1,610,941,000   1,275,500,000   1,233,831,000
10                     Assets Held for Sale Current             NaN             NaN             NaN               0
11                             Other Current Assets     717,620,000     516,656,000     205,876,000     180,900,000
12                         Total non-current assets  64,795,114,000  63,265,178,000  55,424,089,000  53,189,072,000
13          Total Liabilities Net Minority Interest  41,996,071,000  46,620,996,000  41,182,764,000  38,233,092,000
14             Total Equity Gross Minority Interest  54,202,488,000  49,916,800,000  46,756,724,000  43,941,423,000
15                             Total Capitalization  46,732,924,000  41,103,855,000  42,785,937,000  40,862,141,000
16                              Common Stock Equity  37,777,948,000  33,614,280,000  31,178,844,000  28,974,286,000
17                              Net Tangible Assets  31,461,529,000  27,157,067,000  25,379,979,000  22,667,765,000
18                                  Working Capital   6,716,583,000   2,068,516,000  10,877,636,000   9,766,002,000
19                                 Invested Capital  60,755,105,000  63,341,015,000  55,496,540,000  51,385,909,000
20                              Tangible Book Value  31,461,529,000  27,157,067,000  25,379,979,000  22,667,765,000
21                                       Total Debt  22,977,157,000  29,726,735,000  24,317,696,000  22,411,623,000
22                                         Net Debt   9,232,039,000  20,917,482,000  10,627,698,000   9,049,387,000
23                                     Share Issued       8,780,427       8,780,427       8,780,427       8,780,427
24                           Ordinary Shares Number       8,780,427       8,780,427       8,780,427       8,780,427


------------------
(program exited with code: 0)

Press any key to continue . . .

how to get these complete data without downloading the complete page?
by the way, it still didn't get the complete data from total non-current asset and below such as land, building, machinery, etc.
how to get any of that?

asiaphone12 · May-26-2020, 02:58 PM

now I found the problem but I haven't found the solution. the problem is the page have button. when the button not expanded, it give:
div class="" data-test="fin-row" data-reactid="66"

if the button is expanded, it give:
div class="rw-expnded" data-test="fin-row" data-reactid="66"

I try to change class value from "" to "rw-expnded" but to no avail.

I can find the element with xpath, but how to change the class value? can you give me pointer?

my code:

import lxml
from lxml import html
from lxml.html import fromstring, tostring

#opening html file
filename = 'INDF.JK 6,425.00 -325.00 -4.81% Indofood Sukses Makmur Tbk. - Yahoo Finance.html'
with open(filename) as a:
	page = a.read()

#parsing the file
tree = html.fromstring(page)

#search div element, data-set attribute with fin-row value
buttons = tree.xpath("//div[contains(@data-test, 'fin-row')]")

what to do for changing class value (class = "" into class = "rw-expnded")?

***snippsat*** · May-26-2020, 04:15 PM

(May-26-2020, 02:58 PM)asiaphone12 Wrote: I can find the element with xpath, but how to change the class value? can you give me pointer?

You will need Selenium for this.
Now are there are the library like yfinance that work after API change from Yahoo.

asiaphone12 · May-27-2020, 04:17 AM

(May-26-2020, 04:15 PM)snippsat Wrote: You will need Selenium for this.
Now are there are the library like yfinance that work after API change from Yahoo.

I have tried yfinance, when I use it to retrieve financial data in Indonesia Stock Exchange, it return empty data. that's why I want to create my own script Big Grin

I have tried lxml and bs4, I'll try using selenium.

it's kinda fun to develop my own program Dance

I learn new things everyday. atleast I have something to do when got stuck in home lol

asiaphone12 · May-27-2020, 10:32 AM

I found how to click the toggle button with selenium. but all I can click is just the first level of button. it have button in button up to 4 level.

my code:

from selenium import webdriver
from selenium.webdriver.common.by import By

#file path access local html
filename1 = r'D:\ahmad\python\web\INDF.JK 6,425.00 -325.00 -4.81% Indofood Sukses Makmur Tbk. - Yahoo Finance.html' 

#opening file in firefox browser
driver = webdriver.Firefox()
driver.get("file:\\" + filename1)

#accessing toggle button level
level1 = driver.find_elements(By.XPATH, "//button[contains(@class, 'tgglBtn')]")
level2 = driver.find_elements(By.XPATH, "//button[contains(@class, 'tgglBtn')]//button[contains(@class, 'tgglBtn')]")
level3 = driver.find_elements(By.XPATH, "//button[contains(@class, 'tgglBtn')]//button[contains(@class, 'tgglBtn')]//button[contains(@class, 'tgglBtn')]")
level4 = driver.find_elements(By.XPATH, "//button[contains(@class, 'tgglBtn')]//button[contains(@class, 'tgglBtn')]//button[contains(@class, 'tgglBtn')]//button[contains(@class, 'tgglBtn')]")

#clicking all the button
for elemlevel1 in level1:
	elemlevel1.click()

for elemlevel2 in level2:
	elemlevel2.click()

for elemlevel3 in level3:
	elemlevel3.click()

for elemlevel4 in level4:
	elemlevel4.click()

it can click the level 1 button, but for the next level didn't get clicked. how to click those buttons?

[Image: Screenshot-2020-05-27-Indofood-Sukses-Ma...inance.png]

***snippsat*** · (This post was last modified: May-27-2020, 03:20 PM by snippsat.)

(May-27-2020, 10:32 AM)asiaphone12 Wrote: it can click the level 1 button, but for the next level didn't get clicked. how to click those buttons?

There is an Expand All button/link,this open all.

Hmm are you able to run this from local html?
There is lot going on that a local html may not have access to eg JavaScript/Ajax/JSON.
Here is test,so i most click a accept button to get in,then Expand All.
Now can try to get value that shown in the Expand layout.

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
import time

#--| Setup
options = Options()
#options.add_argument("--headless")
#options.add_argument('--disable-gpu')
#options.add_argument('--log-level=3')
browser = webdriver.Chrome(executable_path=r'C:\cmder\bin\chromedriver.exe', options=options)
#--| Parse or automation
browser.get('https://finance.yahoo.com/quote/INDF.JK/balance-sheet?p=INDF.JK')
time.sleep(3)
accept_button = browser.find_elements_by_css_selector('#consent-page > div > div > div > div.wizard-footer > div > form > button.btn.primary')
accept_button[0].click()
time.sleep(3)
expand = browser.find_elements_by_xpath('//*[@id="Col1-1-Financials-Proxy"]/section/div[2]/button')
expand[0].click()

# Example send source to BS for parse
soup = BeautifulSoup(browser.page_source, 'lxml')
price = soup.select_one('#Col1-1-Financials-Proxy > section > div.Pos\(r\) > div.W\(100\%\).Whs\(nw\).Ovx\(a\).BdT.Bdtc\(\$seperatorColor\) > div.M\(0\).Whs\(n\).BdEnd.Bdc\(\$seperatorColor\).D\(itb\) > div.D\(tbrg\) > div:nth-child(1) > div:nth-child(2) > div:nth-child(1) > div:nth-child(2) > div:nth-child(1) > div:nth-child(2) > div:nth-child(1) > div:nth-child(2) > div:nth-child(1) > div.D\(tbr\).fi-row.Bgc\(\$hoverBgColor\)\:h > div:nth-child(2) > span')
print(price.text)

Output:
4,714,869,000

asiaphone12 · May-28-2020, 05:06 AM

(May-27-2020, 03:20 PM)snippsat Wrote: There is an Expand All button/link,this open all.

OH MY GOD!!! I DIDN'T SEE THAT!!!

look's like I missed that lol Wall

I can run this file because I download it in complete format. so it can work offline. I use mobile hotspot for access internet, so I have a hard time to access it online. the waiting time is long, so I using file in local html.

thanks for the pointer, my script have completed Heart

my code:

from selenium import webdriver
from selenium.webdriver.common.by import By
from time import sleep
from lxml import html
import lxml
import numpy as np
import pandas as pd

#file path access local html
filename1 = r'D:\ahmad\python\web\INDF.JK.html' 

#opening file in firefox browser
driver = webdriver.Firefox()
driver.get("file:\\" + filename1)
sleep(5)

#clicking "Expand All"
btnclick = driver.find_elements(By.XPATH, "//*[@id='Col1-1-Financials-Proxy']/section/div[2]/button")
btnclick[0].click()

#parsing into lxml
tree = html.fromstring(driver.page_source)

#searching table financial data
table_rows = tree.xpath("//div[contains(@class, 'D(tbr)')]")

# Ensure that some table rows are found
assert len(table_rows) > 0

parsed_rows = []

for table_row in table_rows:
    parsed_row = []
    el = table_row.xpath("./div")
    
    none_count = 0
    
    for rs in el:
        try:
            (text,) = rs.xpath('.//span/text()[1]')
            parsed_row.append(text)
        except ValueError:
            parsed_row.append(np.NaN)
            none_count += 1

    if (none_count < 4):
        parsed_rows.append(parsed_row)

df = pd.DataFrame(parsed_rows)
print(df)

the result is:

Output:                                                  0               1               2               3               4
0                                         Breakdown      12/30/2019      12/30/2018      12/30/2017      12/30/2016
1                                      Total Assets  96,198,559,000  96,537,796,000  87,939,488,000  82,174,515,000
2                                    Current Assets  31,403,445,000  33,272,618,000  32,515,399,000  28,985,443,000
3   Cash, Cash Equivalents & Short Term Investments  13,800,610,000  12,928,189,000  14,490,157,000  13,896,374,000
4                         Cash And Cash Equivalents  13,745,118,000   8,809,253,000  13,689,998,000  13,362,236,000
..                                              ...             ...             ...             ...             ...
61                              Tangible Book Value  31,461,529,000  27,157,067,000  25,379,979,000  22,667,765,000
62                                       Total Debt  22,977,157,000  29,726,735,000  24,317,696,000  22,411,623,000
63                                         Net Debt   9,232,039,000  20,917,482,000  10,627,698,000   9,049,387,000
64                                     Share Issued       8,780,427       8,780,427       8,780,427       8,780,427
65                           Ordinary Shares Number       8,780,427       8,780,427       8,780,427       8,780,427

[66 rows x 5 columns]


------------------
(program exited with code: 0)

Press any key to continue . . .

thanks for the guidance Big Grin

mick_g · (This post was last modified: Jun-15-2020, 05:52 AM by mick_g.)

Thanks for your nice solution.
Do you know the way to get it with expanded and quartely view at the same time?

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Web Scraping with Yahoo Finance	miloellison	1	2,052	Jul-03-2020, 11:12 PM Last Post: Larz60+
	Django finance tracker	mkb3112	1	1,931	Apr-04-2020, 01:21 PM Last Post: leeacto
	Searching yahoo with selenium	Truman	19	34,650	Oct-13-2018, 11:56 PM Last Post: snippsat
	Scrap Yahoo Finance using BS4	mr_byte31	7	6,181	Aug-24-2018, 02:50 PM Last Post: Larz60+
	webscraping yahoo data - custom date implementation	Jens89	4	5,117	Jun-19-2018, 08:02 AM Last Post: Jens89
	Cant get financial data from google	Adam	0	3,057	Apr-11-2018, 03:02 PM Last Post: Adam

getting financial data from yahoo finance

User Panel Messages

Announcements