Python Forum
webscraping yahoo data - custom date implementation
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
webscraping yahoo data - custom date implementation
#1
Hi,

I'm trying to webscrape historical prices from yahoo finance.
I managed to get the data, however only for the most recent months (which is about 4-5 months)
I can't figure out how to access the time period to be able to add a start and end date.

any help would be really appreciated!

an example of apple below where you can see the time period which I'm trying to access.
https://finance.yahoo.com/quote/AAPL/history?p=AAPL

import bs4 as bs
import urllib.request
import pandas as pd

def get_ticker(ticker):
    
    url = 'https://finance.yahoo.com/quote/' + ticker + '/history?p=' + ticker
    source = urllib.request.urlopen(url).read()      
    soup =bs.BeautifulSoup(source,'lxml')
    tr = soup.find_all('tr')
    
    data = []
    
    for table in tr:
        td = table.find_all('td')
        row = [i.text for i in td]
        data.append(row)        
    
    data = data[1:-2]
    df = pd.DataFrame(data)
    df.columns = columns
    df.set_index(columns[0], inplace=True)
    df = df.convert_objects(convert_numeric=True)
    df = df.iloc[::-1]
    df.dropna(inplace=True)
    
    return df

(Jun-17-2018, 05:24 PM)Jens89 Wrote: Hi,

I'm trying to webscrape historical prices from yahoo finance.
I managed to get the data, however only for the most recent months (which is about 4-5 months)
I can't figure out how to access the time period to be able to add a start and end date.

any help would be really appreciated!

an example of apple below where you can see the time period which I'm trying to access.
https://finance.yahoo.com/quote/AAPL/history?p=AAPL

import bs4 as bs
import urllib.request
import pandas as pd

def get_ticker(ticker):
    
    url = 'https://finance.yahoo.com/quote/' + ticker + '/history?p=' + ticker
    source = urllib.request.urlopen(url).read()      
    soup =bs.BeautifulSoup(source,'lxml')
    tr = soup.find_all('tr')
    
    data = []
    
    for table in tr:
        td = table.find_all('td')
        row = [i.text for i in td]
        data.append(row)        
    
    [b]columns = ['Date', 'Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume'][/b]

    data = data[1:-2]
    df = pd.DataFrame(data)
    df.columns = columns
    df.set_index(columns[0], inplace=True)
    df = df.convert_objects(convert_numeric=True)
    df = df.iloc[::-1]
    df.dropna(inplace=True)
    
    return df

I forgot to add the columns. added in bold
Reply
#2
Please don't try to add bold or color inside the Python tags. It mucks everything up when others are trying to run the code.

I was able to get data for specific dates using the following code. There seems to be a limitation of some number less than 90 days using my code. I had to make an adjustment to the start date, to get the same print out as the web site run manually. The following should help you get started.
import bs4 as bs
import urllib.request
import pandas as pd
import time
 
def get_ticker(ticker, day_one, day_two):
     
    url = 'https://finance.yahoo.com/quote/' + ticker + '/history?period1=' + day_one + '&period2=' + day_two + '&interval=1d&filter=history&frequency=1d'
    source = urllib.request.urlopen(url).read()      
    soup =bs.BeautifulSoup(source,'lxml')
    tr = soup.find_all('tr')
     
    data = []
     
    for table in tr:
        td = table.find_all('td')
        row = [i.text for i in td]
        data.append(row)        
     
    columns = ['Date', 'Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume']
 
    data = data[1:-2]
    df = pd.DataFrame(data)
    df.columns = columns
    df.set_index(columns[0], inplace=True)
    df = df.convert_objects(convert_numeric=True)
    df = df.iloc[::-1]
    df.dropna(inplace=True)
     
    return df
    

# April 3, 2018 = 1522728000  (seconds since UNIX epoch in 1970)
# June 12, 2018 = 1528776000
# https://finance.yahoo.com/quote/AAPL/history?period1=1522728000&period2=1528776000&interval=1d&filter=history&frequency=1d


format_string='%Y-%m-%d %H:%M:%S'

# One day (86400 second) adjustment required to get dates printed to match web site manual output
date1='2018-04-03 00:00:00'
date1_epoch = str(int(time.mktime(time.strptime(date1,format_string)))- 86400)
print("")
print(date1, date1_epoch)

date2='2018-06-12 00:00:00'
date2_epoch = str(int(time.mktime(time.strptime(date2,format_string))))
print(date2, date2_epoch)

df = get_ticker('AAPL', date1_epoch, date2_epoch)
print(df)    
Abridged output:
Output:
2018-04-03 00:00:00 1522728000 2018-06-12 00:00:00 1528776000 Open High Low Close Adj Close Volume Date Apr 03, 2018 167.64 168.75 164.88 168.39 167.74 30,278,000 Apr 04, 2018 164.88 172.01 164.77 171.61 170.95 34,605,500 Apr 05, 2018 172.58 174.23 172.08 172.80 172.14 26,933,200 ... Intermediate data was there - deleted by Lewis to save space Jun 11, 2018 191.35 191.97 190.21 191.23 191.23 18,308,500 Jun 12, 2018 191.39 192.61 191.15 192.28 192.28 16,911,100
Lewis
To paraphrase: 'Throw out your dead' code. https://www.youtube.com/watch?v=grbSQ6O6kbs Forward to 1:00
Reply
#3
(Jun-17-2018, 11:58 PM)ljmetzger Wrote: Please don't try to add bold or color inside the Python tags. It mucks everything up when others are trying to run the code.

Hi, yes I realized that afterwards but I couldn't edit my post anymore.

So do you think there is no way to get any older data? If you could somehow specify the period for which you want the
data I think it should be possible but I'm not sure how..
Reply
#4
You can get more data, but it seems like you have to limit the amount of data at one time to 60 day chunks.

The following example is the equivalent of (January 3, 2018 thru June 12, 2018) the manual URL of: https://finance.yahoo.com/quote/AAPL/his...equency=1d

Significant points:
a. The package monthdelta is required (which you probably do not have installed). To install from the Windows cmd.exe (or equivalent) command line (or Linux equivalent): pip install monthdelta
b. The code is similar to the code that I previously posted. Epoch (seconds) calculation was moved into function get_ticker().
c. The following code snippet was used to iterate through the dates (maximum of two months at a time) and also to concatenate the data frame from get_ticker() into one large dataframe:
iteration_number = 0
while date1 <= end_date:
    iteration_number += 1

    # Create 'date2' in a 60 day Window or less
    # Start 'date2' two months from 'date1'
    # Change the 'day of the month' to the 1st day of the month
    # Subtract 'one day' to change the 1st day of the month, into the last day of the previous month
    date2 = date1 + monthdelta.monthdelta(2)
    date2 = datetime.date(date2.year, date2.month, 1)
    date2 = date2 - datetime.timedelta(days=1)
        
    # Do not allow 'date2' to go beyond the 'End Date'
    if date2 > end_date:
        date2 = end_date
        
    print("Processing {} thru {}.".format(date1, date2))
    stock_symbol = 'AAPL'
    df = get_ticker(stock_symbol, date1, date2)
    
    if iteration_number == 1:
        dfall = df.copy()
    else:
        frames = [dfall, df]
        dfall = pd.concat(frames)

    # # # print(dfall)
    # # # print("len of dfall = {}".format(len(dfall)))

    # Increment the first date for the next pass
    date1 = date1   + monthdelta.monthdelta(2)
    date1 = datetime.date(date1.year, date1.month, 1)
import bs4 as bs
import urllib.request
import pandas as pd
import time
import datetime
import monthdelta
 
def get_ticker(ticker, date1, date2):

    format_string='%Y-%m-%d %H:%M:%S'

    # One day (86400 second) adjustment required to get dates printed to match web site manual output
    _date1 = date1.strftime("%Y-%m-%d 00:00:00")
    date1_epoch = str(int(time.mktime(time.strptime(_date1, format_string)))- 86400)
    print("")
    print(date1, date1_epoch, " + 86,400 = ", str(int(date1_epoch) + 86400))

    _date2 = date2.strftime("%Y-%m-%d 00:00:00")
    date2_epoch = str(int(time.mktime(time.strptime(_date2, format_string))))
    print(date2, date2_epoch)

    url = 'https://finance.yahoo.com/quote/' + ticker + '/history?period1=' + date1_epoch + '&period2=' + date2_epoch + '&interval=1d&filter=history&frequency=1d'
    source = urllib.request.urlopen(url).read()      
    soup =bs.BeautifulSoup(source,'lxml')
    tr = soup.find_all('tr')
     
    data = []
     
    for table in tr:
        td = table.find_all('td')
        row = [i.text for i in td]
        data.append(row)        
     
    columns = ['Date', 'Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume']
 
    data = data[1:-2]
    df = pd.DataFrame(data)
    df.columns = columns
    df.set_index(columns[0], inplace=True)
    df = df.convert_objects(convert_numeric=True)
    df = df.iloc[::-1]
    df.dropna(inplace=True)
     
    return df


# January 3, 2018 = 1514955600  (seconds since UNIX epoch in 1970)
# June   12, 2018 = 1528776000
# https://finance.yahoo.com/quote/AAPL/history?period1=1514955600&period2=1528776000&interval=1d&filter=history&frequency=1d

print("")
print("")
start_date = datetime.date(2018, 1, 3)
end_date = datetime.date(2018, 6, 12)
today = datetime.date.today()

# The statements in this group are for debugging purposes only
format_string='%Y-%m-%d %H:%M:%S'
t1 = start_date.strftime("%Y-%m-%d 00:00:00")
t2 = end_date.strftime("%Y-%m-%d 00:00:00")
start_date_epoch = str(int(time.mktime(time.strptime(t1, format_string))))
end_date_epoch = str(int(time.mktime(time.strptime(t2,format_string))))


# Output all 'original' dates
print('Today     :', today)
print('Start Date:', start_date, 'Start Date Epoch:', start_date_epoch)
print('End   Date:', end_date,   'End   Date Epoch:', end_date_epoch)

# Initialize 'date1'
date1 = start_date

# Initialize 'date1'
date1 = start_date

# Do not allow the 'End Date' to be AFTER today
if today < end_date:
  end_date = today

iteration_number = 0
while date1 <= end_date:
    iteration_number += 1

    # Create 'date2' in a 60 day Window or less
    date2 = date1 + monthdelta.monthdelta(2)
    date2 = datetime.date(date2.year, date2.month, 1)
    date2 = date2 - datetime.timedelta(days=1)
        
    # Do not allow 'date2' to go beyond the 'End Date'
    if date2 > end_date:
        date2 = end_date
        
    print("Processing {} thru {}.".format(date1, date2))
    stock_symbol = 'AAPL'
    df = get_ticker(stock_symbol, date1, date2)
    
    if iteration_number == 1:
        dfall = df.copy()
    else:
        frames = [dfall, df]
        dfall = pd.concat(frames)

    # # # print(dfall)
    # # # print("len of dfall = {}".format(len(dfall)))

    # Increment the first date for the next pass
    date1 = date1   + monthdelta.monthdelta(2)
    date1 = datetime.date(date1.year, date1.month, 1)

print(dfall)
print("len of dfall = {}".format(len(dfall)))
Lewis
To paraphrase: 'Throw out your dead' code. https://www.youtube.com/watch?v=grbSQ6O6kbs Forward to 1:00
Reply
#5
(Jun-18-2018, 08:24 PM)ljmetzger Wrote: You can get more data, but it seems like you have to limit the amount of data at one time to 60 day chunks. The following example is the equivalent of (January 3, 2018 thru June 12, 2018) the manual URL of: https://finance.yahoo.com/quote/AAPL/his...equency=1d Significant points: a. The package monthdelta is required (which you probably do not have installed). To install from the Windows cmd.exe (or equivalent) command line (or Linux equivalent): pip install monthdelta b. The code is similar to the code that I previously posted. Epoch (seconds) calculation was moved into function get_ticker(). c. The following code snippet was used to iterate through the dates (maximum of two months at a time) and also to concatenate the data frame from get_ticker() into one large dataframe:
iteration_number = 0 while date1 <= end_date: iteration_number += 1 # Create 'date2' in a 60 day Window or less # Start 'date2' two months from 'date1' # Change the 'day of the month' to the 1st day of the month # Subtract 'one day' to change the 1st day of the month, into the last day of the previous month date2 = date1 + monthdelta.monthdelta(2) date2 = datetime.date(date2.year, date2.month, 1) date2 = date2 - datetime.timedelta(days=1) # Do not allow 'date2' to go beyond the 'End Date' if date2 > end_date: date2 = end_date print("Processing {} thru {}.".format(date1, date2)) stock_symbol = 'AAPL' df = get_ticker(stock_symbol, date1, date2) if iteration_number == 1: dfall = df.copy() else: frames = [dfall, df] dfall = pd.concat(frames) # # # print(dfall) # # # print("len of dfall = {}".format(len(dfall))) # Increment the first date for the next pass date1 = date1 + monthdelta.monthdelta(2) date1 = datetime.date(date1.year, date1.month, 1) 
import bs4 as bs import urllib.request import pandas as pd import time import datetime import monthdelta def get_ticker(ticker, date1, date2): format_string='%Y-%m-%d %H:%M:%S' # One day (86400 second) adjustment required to get dates printed to match web site manual output _date1 = date1.strftime("%Y-%m-%d 00:00:00") date1_epoch = str(int(time.mktime(time.strptime(_date1, format_string)))- 86400) print("") print(date1, date1_epoch, " + 86,400 = ", str(int(date1_epoch) + 86400)) _date2 = date2.strftime("%Y-%m-%d 00:00:00") date2_epoch = str(int(time.mktime(time.strptime(_date2, format_string)))) print(date2, date2_epoch) url = 'https://finance.yahoo.com/quote/' + ticker + '/history?period1=' + date1_epoch + '&period2=' + date2_epoch + '&interval=1d&filter=history&frequency=1d' source = urllib.request.urlopen(url).read() soup =bs.BeautifulSoup(source,'lxml') tr = soup.find_all('tr') data = [] for table in tr: td = table.find_all('td') row = [i.text for i in td] data.append(row) columns = ['Date', 'Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume'] data = data[1:-2] df = pd.DataFrame(data) df.columns = columns df.set_index(columns[0], inplace=True) df = df.convert_objects(convert_numeric=True) df = df.iloc[::-1] df.dropna(inplace=True) return df # January 3, 2018 = 1514955600 (seconds since UNIX epoch in 1970) # June 12, 2018 = 1528776000 # https://finance.yahoo.com/quote/AAPL/history?period1=1514955600&period2=1528776000&interval=1d&filter=history&frequency=1d print("") print("") start_date = datetime.date(2018, 1, 3) end_date = datetime.date(2018, 6, 12) today = datetime.date.today() # The statements in this group are for debugging purposes only format_string='%Y-%m-%d %H:%M:%S' t1 = start_date.strftime("%Y-%m-%d 00:00:00") t2 = end_date.strftime("%Y-%m-%d 00:00:00") start_date_epoch = str(int(time.mktime(time.strptime(t1, format_string)))) end_date_epoch = str(int(time.mktime(time.strptime(t2,format_string)))) # Output all 'original' dates print('Today :', today) print('Start Date:', start_date, 'Start Date Epoch:', start_date_epoch) print('End Date:', end_date, 'End Date Epoch:', end_date_epoch) # Initialize 'date1' date1 = start_date # Initialize 'date1' date1 = start_date # Do not allow the 'End Date' to be AFTER today if today < end_date: end_date = today iteration_number = 0 while date1 <= end_date: iteration_number += 1 # Create 'date2' in a 60 day Window or less date2 = date1 + monthdelta.monthdelta(2) date2 = datetime.date(date2.year, date2.month, 1) date2 = date2 - datetime.timedelta(days=1) # Do not allow 'date2' to go beyond the 'End Date' if date2 > end_date: date2 = end_date print("Processing {} thru {}.".format(date1, date2)) stock_symbol = 'AAPL' df = get_ticker(stock_symbol, date1, date2) if iteration_number == 1: dfall = df.copy() else: frames = [dfall, df] dfall = pd.concat(frames) # # # print(dfall) # # # print("len of dfall = {}".format(len(dfall))) # Increment the first date for the next pass date1 = date1 + monthdelta.monthdelta(2) date1 = datetime.date(date1.year, date1.month, 1) print(dfall) print("len of dfall = {}".format(len(dfall))) 
Lewis

that`s awesome! thanks so much for spending time on this. The only thing I changed for now is that I omitted the monthdelta library. I'm using anaconda and I tried to conda install it but that didn't work. instead I used datetime.timedelta which seems to do the trick. Below the full code with my changes FYI.

import bs4 as bs
import urllib.request
import pandas as pd
import time
import datetime

  
def get_ticker(ticker, date1, date2):
 
    format_string='%Y-%m-%d %H:%M:%S'
 
    # One day (86400 second) adjustment required to get dates printed to match web site manual output
    _date1 = date1.strftime("%Y-%m-%d 00:00:00")
    date1_epoch = str(int(time.mktime(time.strptime(_date1, format_string)))- 86400)
    print("")
    print(date1, date1_epoch, " + 86,400 = ", str(int(date1_epoch) + 86400))
 
    _date2 = date2.strftime("%Y-%m-%d 00:00:00")
    date2_epoch = str(int(time.mktime(time.strptime(_date2, format_string))))
    print(date2, date2_epoch)
 
    url = 'https://finance.yahoo.com/quote/' + ticker + '/history?period1=' + date1_epoch + '&period2=' + date2_epoch + '&interval=1d&filter=history&frequency=1d'
    source = urllib.request.urlopen(url).read()      
    soup =bs.BeautifulSoup(source,'lxml')
    tr = soup.find_all('tr')
      
    data = []
      
    for table in tr:
        td = table.find_all('td')
        row = [i.text for i in td]
        data.append(row)        
      
    columns = ['Date', 'Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume']
  
    data = data[1:-2]
    df = pd.DataFrame(data)
    df.columns = columns
    df.set_index(columns[0], inplace=True)
    df = df.convert_objects(convert_numeric=True)
    df = df.iloc[::-1]
    df.dropna(inplace=True)
      
    return df
 
 
# January 3, 2018 = 1514955600  (seconds since UNIX epoch in 1970)
# June   12, 2018 = 1528776000
# https://finance.yahoo.com/quote/AAPL/history?period1=1514955600&period2=1528776000&interval=1d&filter=history&frequency=1d
 
print("")
print("")
start_date = datetime.date(2005, 1, 3)
end_date = datetime.date(2018, 6, 12)
today = datetime.date.today()
 
# The statements in this group are for debugging purposes only
format_string='%Y-%m-%d %H:%M:%S'
t1 = start_date.strftime("%Y-%m-%d 00:00:00")
t2 = end_date.strftime("%Y-%m-%d 00:00:00")
start_date_epoch = str(int(time.mktime(time.strptime(t1, format_string))))
end_date_epoch = str(int(time.mktime(time.strptime(t2,format_string))))
 
 
# Output all 'original' dates
print('Today     :', today)
print('Start Date:', start_date, 'Start Date Epoch:', start_date_epoch)
print('End   Date:', end_date,   'End   Date Epoch:', end_date_epoch)
 
# Initialize 'date1'
date1 = start_date
 
# Initialize 'date1'
date1 = start_date
 
# Do not allow the 'End Date' to be AFTER today
if today < end_date:
  end_date = today
 
iteration_number = 0
while date1 <= end_date:
    iteration_number += 1
 
    # Create 'date2' in a 60 day Window or less
    date2 = date1 + datetime.timedelta(days=60)
    date2 = datetime.date(date2.year, date2.month, 1)
    date2 = date2 - datetime.timedelta(days=1)
         
    # Do not allow 'date2' to go beyond the 'End Date'
    if date2 > end_date:
        date2 = end_date
         
    print("Processing {} thru {}.".format(date1, date2))
    stock_symbol = 'AAPL'
    df = get_ticker(stock_symbol, date1, date2)
     
    if iteration_number == 1:
        dfall = df.copy()
    else:
        frames = [dfall, df]
        dfall = pd.concat(frames)
 
    # # # print(dfall)
    # # # print("len of dfall = {}".format(len(dfall)))
 
    # Increment the first date for the next pass
    date1 = date1   + datetime.timedelta(days=60)
    date1 = datetime.date(date1.year, date1.month, 1)
 
print(dfall)
print("len of dfall = {}".format(len(dfall)))
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Webscraping news articles by using selenium cate16 7 2,954 Aug-28-2023, 09:58 AM
Last Post: snippsat
  Webscraping with beautifulsoup cormanstan 3 1,850 Aug-24-2023, 11:57 AM
Last Post: snippsat
  Webscraping returning empty table Buuuwq 0 1,350 Dec-09-2022, 10:41 AM
Last Post: Buuuwq
  WebScraping using Selenium library Korgik 0 1,019 Dec-09-2022, 09:51 AM
Last Post: Korgik
  How to get rid of numerical tokens in output (webscraping issue)? jps2020 0 1,914 Oct-26-2020, 05:37 PM
Last Post: jps2020
  Web Scraping with Yahoo Finance miloellison 1 2,029 Jul-03-2020, 11:12 PM
Last Post: Larz60+
  getting financial data from yahoo finance asiaphone12 7 6,875 Jun-15-2020, 05:49 AM
Last Post: mick_g
  Python Webscraping with a Login Website warriordazza 0 2,571 Jun-07-2020, 07:04 AM
Last Post: warriordazza
  Help with basic webscraping Captain_Snuggle 2 3,875 Nov-07-2019, 08:07 PM
Last Post: kozaizsvemira
  Can't Resolve Webscraping AttributeError Hass 1 2,259 Jan-15-2019, 09:36 PM
Last Post: nilamo

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020