Inconsistent behaviour in output - web scraping - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: General Coding Help (https://python-forum.io/forum-8.html) +--- Thread: Inconsistent behaviour in output - web scraping (/thread-34945.html) |
Inconsistent behaviour in output - web scraping - Steve - Sep-18-2021 Hi all, I'm not an expert coding in python and have used it a few times to write things like the following example code which is meant to extract some information via scrapping from Yahoo finance. The code is: from bs4 import BeautifulSoup import requests as req import re import urllib3 import sys # Parse input command line parameters. if (len(sys.argv) > 1): sStockTicker = sys.argv[1] else: # TODO: Change this to uncomment the exit when debug is finished print('Invalid usage. A stock ticker must be passed as parameter.') sStockTicker='FMG.AX' #sys.exit(2) url='https://finance.yahoo.com/quote/' + sStockTicker req = urllib3.PoolManager() res = req.request("GET", url) soup3 = BeautifulSoup(res.data,'lxml') print (soup3.find(id="quote-header-info").contents[2].contents[0].contents[0].contents[0].text)Now, depending on what stock I run this for (the parameter), the code will generate data the has the same content as the web page that I would see via accessing the same URL in a web browser, such as Chrome/Firefox, and then extract the stock price. It does not always do this though. If I run the script for "IBM", it will work fine. If I run it for "FMG.AX" it will work fine. However, if I run it for "IOZ.AX" it will fail. If I paste the same URL into the web browser, it will load perfectly fine and show the expected results. Hence - my interpretation of the issue - the code is pulling back a different result using urllib3 than I get from the browser, and I'm a little baffled as to why. I assume the web server is noticing it is being called from Python somehow based on the values in the request, and then returning a different response. Does anyone know why this is occurring and how I can work around it? Is it actually caused for the reason I said above? I am actually very interested in this intellectually as well as resolving the issue in the script. I find it very odd that a server script would respond in different ways purposefully like this (if that is indeed the cause). TIA! Steve. RE: Inconsistent behaviour in output - web scraping - Larz60+ - Sep-18-2021 here's one I did a while back: https://python-forum.io/thread-22481.html?highlight=stock RE: Inconsistent behaviour in output - web scraping - Steve - Sep-19-2021 (Sep-18-2021, 01:30 PM)Larz60+ Wrote: here's one I did a while back: https://python-forum.io/thread-22481.html?highlight=stock Thanks for that. I had a look at your code, and it's a bit different from the problem that I'm having. I'm quite interested in resolving this issue and I'm wondering what the cause is? Maybe there's something I need to do when scraping like this that will help me in the future. Thanks for your input though. RE: Inconsistent behaviour in output - web scraping - snippsat - Sep-19-2021 It's normal for stock sites that need to use Selenium ,as site like this use a lot of JavaScript. There are library like yfinance that work after API change from Yahoo. Here a older Thread with similar task. RE: Inconsistent behaviour in output - web scraping - Larz60+ - Sep-19-2021 here's how I would scrape that (code uses PrettifyPage (included below), needs to be in same directory as main code.) GetQuotes.py import requests from bs4 import BeautifulSoup from PrettifyPage import PrettifyPage from TryPaths import TryPaths import sys class GetQuotes: def __init__(self): self.tpath = TryPaths() self.pp = PrettifyPage().prettify def get_quote(self, symbol): url = f"https://finance.yahoo.com/quote/{symbol}" response = requests.get(url) if response.status_code != 200: print(f"url {url} not found") sys.exit(-1) soup = BeautifulSoup(response.text, 'lxml') header_info = soup.find('div', id='quote-header-info') prettyfile = self.tpath.textpath / "response.html" with prettyfile.open('w') as fp: fp.write(self.pp(header_info, 2)) def main(argv): if len(argv) > 1: symbol = argv[1] else: symbol = 'FMG.AX' gq = GetQuotes() gq.get_quote(symbol) if __name__ == '__main__': main(sys.argv)PrettifyPage.py from bs4 import BeautifulSoup import requests import pathlib class PrettifyPage: def __init__(self): pass def prettify(self, soup, indent): pretty_soup = str() previous_indent = 0 for line in soup.prettify().split("\n"): current_indent = str(line).find("<") if current_indent == -1 or current_indent > previous_indent + 2: current_indent = previous_indent + 1 previous_indent = current_indent pretty_soup += self.write_new_line(line, current_indent, indent) return pretty_soup def write_new_line(self, line, current_indent, desired_indent): new_line = "" spaces_to_add = (current_indent * desired_indent) - current_indent if spaces_to_add > 0: for i in range(spaces_to_add): new_line += " " new_line += str(line) + "\n" return new_line if __name__ == '__main__': pp = PrettifyPage() pfilename = pp.bpath.htmlpath / 'BusinessEntityRecordsAA.html' with pfilename.open('rb') as fp: page = fp.read() soup = BeautifulSoup(page, 'lxml') pretty = pp.prettify(soup, indent=2) print(pretty)using the default of 'FMG.AX', the div containing "quote-header-info" contains quite a bit of information what did you want to extract from this? Partial results:
RE: Inconsistent behaviour in output - web scraping - snippsat - Sep-19-2021 (Sep-18-2021, 10:25 AM)Steve Wrote: if I run it for "IOZ.AX" it will fail. >>> import yfinance as yf >>> >>> ioz = yf.Ticker("IOZ.AX") >>> ioz.info['symbol'] 'IOZ.AX' >>> ioz.info['regularMarketPrice'] 30.71 >>> ioz.info['regularMarketDayHigh'] 30.9 >>> ioz.info['currency'] 'AUD' RE: Inconsistent behaviour in output - web scraping - Larz60+ - Sep-20-2021 Snippsat is correct, I should have used selenium from the start. Here's a version that will get the header_info that you were looking for. You can use beautiful to extract details from that. from selenium import webdriver from bs4 import BeautifulSoup import time import os import sys class GetQuotes: def __init__(self): # Make sure path same as script location os.chdir(os.path.abspath(os.path.dirname(__file__))) def get_quote(self, symbol): url = f"https://finance.yahoo.com/quote/{symbol}" self.start_browser() self.browser.get(url) time.sleep(4) page = self.browser.page_source soup = BeautifulSoup(page, 'lxml') self.stop_browser() header_info = soup.find('div', id='quote-header-info') # Extract your info from header_info, example below def start_browser(self): caps = webdriver.DesiredCapabilities().FIREFOX caps["marionette"] = True self.browser = webdriver.Firefox(capabilities=caps) def stop_browser(self): self.browser.close() def main(argv): if len(argv) > 1: symbol = argv[1] else: symbol = 'IOZ.AX' gq = GetQuotes() gq.get_quote(symbol) if __name__ == '__main__': main(sys.argv)This is what's contained in header_info:
|