Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Webscraping with beautifulsoup
#1
hey everyone,

I've been working on a project and it's not working as intended. I hope someone here can help. I have a basic understanding of python, I would really appreciate any help.

The project consists of using python and yfinance to extract some stock data and web scraping a website to extract testa quarterly revenue from a table. The problem is within part 2 while trying to download a url as a text file to be parsed by beautifulsoup and when I try to remove the comma and dollar signs.

I get a 403 error when I print(soup), it appears I'm just being blocked by the website but it seemed to have worked before. Am I wrong? Is there another way to web scrap the website without having the error?

Install the packages
!pip install yfinance
!pip install bs4
Imported the libraries
import yfinance as yf
import pandas as pd
import requests
from bs4 import BeautifulSoup
import plotly.graph_objects as go
from plotly.subplots import make_subplots
Part 2: Webscraping to extract tesla revenue

Define the url and download the text file
url = "https://www.macrotrends.net/stocks/charts/TSLA/tesla/revenue"
html_data = requests.get(url).text
Parse the html data using beautifulsoup
soup = BeautifulSoup(html_data)
print(soup)

Error:
<?xml version="1.0" encoding="utf-8"?><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html> <head> <title>403 Forbidden</title> </head> <body> <h1>Error 403 Forbidden</h1> <p>Forbidden</p> <h3>Error 54113</h3> <p>Details: cache-nrt-rjtf7700062-NRT 1692742966 2184915405</p> <hr/> <p>Varnish cache server</p> </body> </html>
Then I try looking for the table entitled "Tesla Quarterly Revenue" with two columns for date and price.
data = []
for table in soup.find_all("table"):
    
    if any(["Tesla Quarterly Revenue".lower() in th.text.lower() for th in table.find_all("th")]):
        for row in table.find("tbody").find_all("tr"):
            date_col, rev_col = [col for col in row.find_all("td")]
            data.append({
                "Date": date_col.text,
                "Revenue": rev_col.text
            })

tesla_revenue = pd.DataFrame(data)
Remove the comma and dollar sign
tesla_revenue["Revenue"] = tesla_revenue['Revenue'].str.replace(',|\$',"")
I get the following error.
Error:
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) File ~\anaconda3\Lib\site-packages\pandas\core\indexes\base.py:3802, in Index.get_loc(self, key, method, tolerance) 3801 try: -> 3802 return self._engine.get_loc(casted_key) 3803 except KeyError as err: File ~\anaconda3\Lib\site-packages\pandas\_libs\index.pyx:138, in pandas._libs.index.IndexEngine.get_loc() File ~\anaconda3\Lib\site-packages\pandas\_libs\index.pyx:165, in pandas._libs.index.IndexEngine.get_loc() File pandas\_libs\hashtable_class_helper.pxi:5745, in pandas._libs.hashtable.PyObjectHashTable.get_item() File pandas\_libs\hashtable_class_helper.pxi:5753, in pandas._libs.hashtable.PyObjectHashTable.get_item() KeyError: 'Revenue' The above exception was the direct cause of the following exception: KeyError Traceback (most recent call last) Cell In[12], line 1 ----> 1 tesla_revenue["Revenue"] = tesla_revenue['Revenue'].str.replace(',|\$',"") File ~\anaconda3\Lib\site-packages\pandas\core\frame.py:3807, in DataFrame.__getitem__(self, key) 3805 if self.columns.nlevels > 1: 3806 return self._getitem_multilevel(key) -> 3807 indexer = self.columns.get_loc(key) 3808 if is_integer(indexer): 3809 indexer = [indexer] File ~\anaconda3\Lib\site-packages\pandas\core\indexes\base.py:3804, in Index.get_loc(self, key, method, tolerance) 3802 return self._engine.get_loc(casted_key) 3803 except KeyError as err: -> 3804 raise KeyError(key) from err 3805 except TypeError: 3806 # If we have a listlike key, _check_indexing_error will raise 3807 # InvalidIndexError. Otherwise we fall through and re-raise 3808 # the TypeError. 3809 self._check_indexing_error(key) KeyError: 'Revenue'
It just appears that I'm being blocked by the website so no data is being passed along. Is this correct? Any suggestions?
Reply


Messages In This Thread
Webscraping with beautifulsoup - by cormanstan - Aug-23-2023, 12:04 AM
RE: Webscraping with beautifulsoup - by snippsat - Aug-23-2023, 07:09 AM
RE: Webscraping with beautifulsoup - by cormanstan - Aug-24-2023, 12:02 AM
RE: Webscraping with beautifulsoup - by snippsat - Aug-24-2023, 11:57 AM

Possibly Related Threads…
Thread Author Replies Views Last Post
  Webscraping news articles by using selenium cate16 7 3,248 Aug-28-2023, 09:58 AM
Last Post: snippsat
  Webscraping returning empty table Buuuwq 0 1,434 Dec-09-2022, 10:41 AM
Last Post: Buuuwq
  WebScraping using Selenium library Korgik 0 1,067 Dec-09-2022, 09:51 AM
Last Post: Korgik
  How to get rid of numerical tokens in output (webscraping issue)? jps2020 0 1,971 Oct-26-2020, 05:37 PM
Last Post: jps2020
  Python Webscraping with a Login Website warriordazza 0 2,630 Jun-07-2020, 07:04 AM
Last Post: warriordazza
  Help with basic webscraping Captain_Snuggle 2 3,982 Nov-07-2019, 08:07 PM
Last Post: kozaizsvemira
  Can't Resolve Webscraping AttributeError Hass 1 2,340 Jan-15-2019, 09:36 PM
Last Post: nilamo
  How to exclude certain links while webscraping basis on keywords Prince_Bhatia 0 3,264 Oct-31-2018, 07:00 AM
Last Post: Prince_Bhatia
  Webscraping homework Ghigo1995 1 2,679 Sep-23-2018, 07:36 PM
Last Post: nilamo
  Intro to WebScraping d1rjr03 2 3,484 Aug-15-2018, 12:05 AM
Last Post: metulburr

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020