Webscraping with beautifulsoup

cormanstan · Aug-23-2023, 12:04 AM

hey everyone,

I've been working on a project and it's not working as intended. I hope someone here can help. I have a basic understanding of python, I would really appreciate any help.

The project consists of using python and yfinance to extract some stock data and web scraping a website to extract testa quarterly revenue from a table. The problem is within part 2 while trying to download a url as a text file to be parsed by beautifulsoup and when I try to remove the comma and dollar signs.

I get a 403 error when I print(soup), it appears I'm just being blocked by the website but it seemed to have worked before. Am I wrong? Is there another way to web scrap the website without having the error?

Install the packages

!pip install yfinance

!pip install bs4

Imported the libraries

import yfinance as yf

import pandas as pd

import requests

from bs4 import BeautifulSoup

import plotly.graph_objects as go
from plotly.subplots import make_subplots

Part 2: Webscraping to extract tesla revenue

Define the url and download the text file

url = "https://www.macrotrends.net/stocks/charts/TSLA/tesla/revenue"
html_data = requests.get(url).text

Parse the html data using beautifulsoup

soup = BeautifulSoup(html_data)

print(soup)

Error:<?xml version="1.0" encoding="utf-8"?><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html>
<head>
<title>403 Forbidden</title>
</head>
<body>
<h1>Error 403 Forbidden</h1>
<p>Forbidden</p>
<h3>Error 54113</h3>
<p>Details: cache-nrt-rjtf7700062-NRT 1692742966 2184915405</p>
<hr/>
<p>Varnish cache server</p>
</body>
</html>

Then I try looking for the table entitled "Tesla Quarterly Revenue" with two columns for date and price.

data = []
for table in soup.find_all("table"):
    
    if any(["Tesla Quarterly Revenue".lower() in th.text.lower() for th in table.find_all("th")]):
        for row in table.find("tbody").find_all("tr"):
            date_col, rev_col = [col for col in row.find_all("td")]
            data.append({
                "Date": date_col.text,
                "Revenue": rev_col.text
            })

tesla_revenue = pd.DataFrame(data)

Remove the comma and dollar sign

tesla_revenue["Revenue"] = tesla_revenue['Revenue'].str.replace(',|\$',"")

I get the following error.

Error:---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File ~\anaconda3\Lib\site-packages\pandas\core\indexes\base.py:3802, in Index.get_loc(self, key, method, tolerance)
   3801 try:
-> 3802     return self._engine.get_loc(casted_key)
   3803 except KeyError as err:

File ~\anaconda3\Lib\site-packages\pandas\_libs\index.pyx:138, in pandas._libs.index.IndexEngine.get_loc()

File ~\anaconda3\Lib\site-packages\pandas\_libs\index.pyx:165, in pandas._libs.index.IndexEngine.get_loc()

File pandas\_libs\hashtable_class_helper.pxi:5745, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas\_libs\hashtable_class_helper.pxi:5753, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'Revenue'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[12], line 1
----> 1 tesla_revenue["Revenue"] = tesla_revenue['Revenue'].str.replace(',|\$',"")

File ~\anaconda3\Lib\site-packages\pandas\core\frame.py:3807, in DataFrame.__getitem__(self, key)
   3805 if self.columns.nlevels > 1:
   3806     return self._getitem_multilevel(key)
-> 3807 indexer = self.columns.get_loc(key)
   3808 if is_integer(indexer):
   3809     indexer = [indexer]

File ~\anaconda3\Lib\site-packages\pandas\core\indexes\base.py:3804, in Index.get_loc(self, key, method, tolerance)
   3802     return self._engine.get_loc(casted_key)
   3803 except KeyError as err:
-> 3804     raise KeyError(key) from err
   3805 except TypeError:
   3806     # If we have a listlike key, _check_indexing_error will raise
   3807     #  InvalidIndexError. Otherwise we fall through and re-raise
   3808     #  the TypeError.
   3809     self._check_indexing_error(key)

KeyError: 'Revenue'

It just appears that I'm being blocked by the website so no data is being passed along. Is this correct? Any suggestions?

***snippsat*** · Aug-23-2023, 07:09 AM

(Aug-23-2023, 12:04 AM)cormanstan Wrote: It just appears that I'm being blocked by the website so no data is being passed along. Is this correct? Any suggestions?

Set user agent then it will work.

import requests
from bs4 import BeautifulSoup

url = 'https://www.macrotrends.net/stocks/charts/TSLA/tesla/revenue'
headers = {"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36"}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, 'lxml')
print(soup.find('h2').text)

Output:
Tesla Revenue 2010-2023 | TSLA

cormanstan · Aug-24-2023, 12:02 AM

(Aug-23-2023, 07:09 AM)snippsat Wrote:
(Aug-23-2023, 12:04 AM)cormanstan Wrote: It just appears that I'm being blocked by the website so no data is being passed along. Is this correct? Any suggestions?
Set user agent then it will work.
import requests
from bs4 import BeautifulSoup

url = 'https://www.macrotrends.net/stocks/charts/TSLA/tesla/revenue'
headers = {"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36"}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, 'lxml')
print(soup.find('h2').text)
Output:
Tesla Revenue 2010-2023 | TSLA

Thanks for the reply and it appears to have worked. May I ask why adding the user agent was so important? I have another question if you don't mind.

I'm trying to use the make_graph function on several datasets and keep getting the same error.

make_graph(gme_data, gme_revenue, 'GameStop')

Error:---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[140], line 1
----> 1 make_graph(gme_data, gme_revenue, 'GameStop')

Cell In[120], line 6, in make_graph(stock_data, revenue_data, stock)
      4 revenue_data_specific = revenue_data[revenue_data.Date <= '2021-04-30']
      5 fig.add_trace(go.Scatter(x=pd.to_datetime(stock_data_specific.Date, infer_datetime_format=True), y=stock_data_specific.Close.astype("float"), name="Share Price"), row=1, col=1)
----> 6 fig.add_trace(go.Scatter(x=pd.to_datetime(revenue_data_specific.Date, infer_datetime_format=True), y=revenue_data_specific.Revenue.astype("float"), name="Revenue"), row=2, col=1)
      7 fig.update_xaxes(title_text="Date", row=1, col=1)
      8 fig.update_xaxes(title_text="Date", row=2, col=1)

File ~\anaconda3\Lib\site-packages\pandas\core\generic.py:6240, in NDFrame.astype(self, dtype, copy, errors)
   6233     results = [
   6234         self.iloc[:, i].astype(dtype, copy=copy)
   6235         for i in range(len(self.columns))
   6236     ]
   6238 else:
   6239     # else, only a single dtype is given
-> 6240     new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors)
   6241     return self._constructor(new_data).__finalize__(self, method="astype")
   6243 # GH 33113: handle empty frame or series

File ~\anaconda3\Lib\site-packages\pandas\core\internals\managers.py:448, in BaseBlockManager.astype(self, dtype, copy, errors)
    447 def astype(self: T, dtype, copy: bool = False, errors: str = "raise") -> T:
--> 448     return self.apply("astype", dtype=dtype, copy=copy, errors=errors)

File ~\anaconda3\Lib\site-packages\pandas\core\internals\managers.py:352, in BaseBlockManager.apply(self, f, align_keys, ignore_failures, **kwargs)
    350         applied = b.apply(f, **kwargs)
    351     else:
--> 352         applied = getattr(b, f)(**kwargs)
    353 except (TypeError, NotImplementedError):
    354     if not ignore_failures:

File ~\anaconda3\Lib\site-packages\pandas\core\internals\blocks.py:526, in Block.astype(self, dtype, copy, errors)
    508 """
    509 Coerce to the new dtype.
    510 
   (...)
    522 Block
    523 """
    524 values = self.values
--> 526 new_values = astype_array_safe(values, dtype, copy=copy, errors=errors)
    528 new_values = maybe_coerce_values(new_values)
    529 newb = self.make_block(new_values)

File ~\anaconda3\Lib\site-packages\pandas\core\dtypes\astype.py:299, in astype_array_safe(values, dtype, copy, errors)
    296     return values.copy()
    298 try:
--> 299     new_values = astype_array(values, dtype, copy=copy)
    300 except (ValueError, TypeError):
    301     # e.g. astype_nansafe can fail on object-dtype of strings
    302     #  trying to convert to float
    303     if errors == "ignore":

File ~\anaconda3\Lib\site-packages\pandas\core\dtypes\astype.py:230, in astype_array(values, dtype, copy)
    227     values = values.astype(dtype, copy=copy)
    229 else:
--> 230     values = astype_nansafe(values, dtype, copy=copy)
    232 # in pandas we don't store numpy str dtypes, so convert to object
    233 if isinstance(dtype, np.dtype) and issubclass(values.dtype.type, str):

File ~\anaconda3\Lib\site-packages\pandas\core\dtypes\astype.py:170, in astype_nansafe(arr, dtype, copy, skipna)
    166     raise ValueError(msg)
    168 if copy or is_object_dtype(arr.dtype) or is_object_dtype(dtype):
    169     # Explicit copy, or required since NumPy can't view from / to object.
--> 170     return arr.astype(dtype, copy=True)
    172 return arr.astype(dtype, copy=copy)

ValueError: could not convert string to float: '$10,389'

I originally defined the make_graph function as the following:

def make_graph(stock_data, revenue_data, stock):
    fig = make_subplots(rows=2, cols=1, shared_xaxes=True, subplot_titles=("Historical Share Price", "Historical Revenue"), vertical_spacing = .3)
    stock_data_specific = stock_data[stock_data.Date <= '2021--06-14']
    revenue_data_specific = revenue_data[revenue_data.Date <= '2021-04-30']
    fig.add_trace(go.Scatter(x=pd.to_datetime(stock_data_specific.Date, infer_datetime_format=True), y=stock_data_specific.Close.astype("float"), name="Share Price"), row=1, col=1)
    fig.add_trace(go.Scatter(x=pd.to_datetime(revenue_data_specific.Date, infer_datetime_format=True), y=revenue_data_specific.Revenue.astype("float"), name="Revenue"), row=2, col=1)
    fig.update_xaxes(title_text="Date", row=1, col=1)
    fig.update_xaxes(title_text="Date", row=2, col=1)
    fig.update_yaxes(title_text="Price ($US)", row=1, col=1)
    fig.update_yaxes(title_text="Revenue ($US Millions)", row=2, col=1)
    fig.update_layout(showlegend=False,
    height=900,
    title=stock,
    xaxis_rangeslider_visible=True)
    fig.show()

And even tried to add a preprocess line of code to the above code but I continued getting the same error.

# Preprocess the revenue data
    revenue_data_specific["Revenue"] = revenue_data_specific["Revenue"].str.replace(r',|\$', "").astype(float)

I had the same error when trying to make_graph tesla data and solved it with. I tried using a similar piece of code for gme_data and gme_revenue but it didn't help.

tesla_revenue["Revenue"] = tesla_revenue['Revenue'].str.replace(',|\$',"")

Any suggestions how I can fix the error and call the make_graph function on gme_data and gme_revenue?

***snippsat*** · (This post was last modified: Aug-24-2023, 11:57 AM by snippsat.)

(Aug-24-2023, 12:02 AM)cormanstan Wrote: May I ask why adding the user agent was so important?

User-Agent the website may block your requests because it knows you aren't a real user,like web-scraping or a bot.

Quote:Any suggestions how I can fix the error and call the make_graph function on gme_data and gme_revenue?

A test,so like this it should work see that most set regex=True if using a more that singel regex pattern in replace.

import pandas as pd

data = {
    'Date': ['2021/11/24', '2021/11/25', '2021/11/26', '2021/11/27'],
    'Price': ['$10,389', '$10,450', '$10,520', '$10,600'],
}

df = pd.DataFrame(data)
df['Date'] = pd.to_datetime(df['Date'])
df['Price'] = df['Price'].str.replace(r'\$|,', '', regex=True).astype(int)

>>> df
        Date  Price
0 2021-11-24  10389
1 2021-11-25  10450
2 2021-11-26  10520
3 2021-11-27  10600

# Works is now type int32
>>> df['Price'].max()
10600

>>> df.dtypes
Date     datetime64[ns]
Price             int32
dtype: object

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Webscraping news articles by using selenium	cate16	7	3,155	Aug-28-2023, 09:58 AM Last Post: snippsat
	Webscraping returning empty table	Buuuwq	0	1,404	Dec-09-2022, 10:41 AM Last Post: Buuuwq
	WebScraping using Selenium library	Korgik	0	1,051	Dec-09-2022, 09:51 AM Last Post: Korgik
	How to get rid of numerical tokens in output (webscraping issue)?	jps2020	0	1,956	Oct-26-2020, 05:37 PM Last Post: jps2020
	Python Webscraping with a Login Website	warriordazza	0	2,610	Jun-07-2020, 07:04 AM Last Post: warriordazza
	Help with basic webscraping	Captain_Snuggle	2	3,944	Nov-07-2019, 08:07 PM Last Post: kozaizsvemira
	Can't Resolve Webscraping AttributeError	Hass	1	2,318	Jan-15-2019, 09:36 PM Last Post: nilamo
	How to exclude certain links while webscraping basis on keywords	Prince_Bhatia	0	3,247	Oct-31-2018, 07:00 AM Last Post: Prince_Bhatia
	Webscraping homework	Ghigo1995	1	2,653	Sep-23-2018, 07:36 PM Last Post: nilamo
	Intro to WebScraping	d1rjr03	2	3,456	Aug-15-2018, 12:05 AM Last Post: metulburr

Webscraping with beautifulsoup

User Panel Messages

Announcements