html_table_parser_python3 KeyError odd behavior

idratherbecoding · (This post was last modified: Apr-13-2023, 05:38 AM by buran.)

Hello,

I am new to Python, but I have been hobby coding off and on for many years. I am working on a project to scrape sports data (NFL) and I am running into an issue while using the package html_table_parser_python3. The table on the page I am scraping has 33 rows (according to the shape[0] of my pandas DataFrame object). I am trying to access information on the 30th row (index 29), but I am getting a KeyError thrown saying 29 is not in range. I have provided the relevant code snippet below along with the error message. I am not sure if the issue is with html_parser, pandas, or something else. I appreciate any help. Thanks.

def get_rb_data(home_rbs, away_rbs, home, away):
    rb_html = get_table_from_url('https://www.teamrankings.com/nfl/stat/rushing-attempts-per-game').decode('utf-8')
    parsed_rb_html = HTMLTableParser()
    parsed_rb_html.feed(rb_html)
    rb_data_frame = pd.DataFrame(parsed_rb_html.tables[0])

    home_rushes = 0

    away_rushes = 0

    print(rb_data_frame.shape[0]) #This is for troubleshooting to see how many rows are in the table. Prints 33.

    for x in range(rb_data_frame.shape[0]):
        if rb_data_frame.loc[x][1] == home:
            home_rushes = float(rb_data_frame[x][2]) * 17

    for x in range(rb_data_frame.shape[0]):
        if rb_data_frame.loc[x][1] == away:
            away_rushes = float(rb_data_frame[x][2]) * 17

if __name__ == '__main__':
    texans_rbs = ['Dameon Pierce', 'Rex Burkhead']
    broncos_rbs = ['Latavius Murray', 'Chase Edmonds']

    #Error is thrown on this line. See below for full traceback.
    texans_rb_attributes, broncos_rb_attributes = get_rb_data(texans_rbs, broncos_rbs, 'Houston', 'Denver')

Error:Traceback (most recent call last):
  File "/Users/aaronlott/PycharmProjects/Scraping/venv/lib/python3.9/site-packages/pandas/core/indexes/range.py", line 345, in get_loc
    return self._range.index(new_key)
ValueError: 29 is not in range

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/aaronlott/PycharmProjects/Scraping/main.py", line 326, in <module>
    texans_rb_butes, broncos_rb_butes = get_rb_data(texans_rbs, broncos_rbs, 'Houston', 'Denver')
  File "/Users/aaronlott/PycharmProjects/Scraping/main.py", line 94, in get_rb_data
    home_rushes = float(rb_data_frame[x][2]) * 17
  File "/Users/aaronlott/PycharmProjects/Scraping/venv/lib/python3.9/site-packages/pandas/core/frame.py", line 3760, in __getitem__
    indexer = self.columns.get_loc(key)
  File "/Users/aaronlott/PycharmProjects/Scraping/venv/lib/python3.9/site-packages/pandas/core/indexes/range.py", line 347, in get_loc
    raise KeyError(key) from err
KeyError: 29

buran write Apr-13-2023, 05:38 AM:
error tags fixed

idratherbecoding · Apr-13-2023, 12:53 PM

I’ve been doing some searching on my own since I posted. I think the problem is that I was not indexing properly and using loc when I should be using iloc.

Instead of df.iloc[row, col], which appears to be what I am trying to do, I was using df.loc[row][col], which uses labels rather than integer indices, hence the error. I will have to wait until I get home from work to verify this solves my problem, but if anyone wants to confirm for me before then, that would be appreciated. Thanks.

***snippsat*** · Apr-13-2023, 03:07 PM

See some problem here,to give some tips.
Pandas can parse table

import pandas as pd

df = pd.read_html('https://www.teamrankings.com/nfl/stat/rushing-attempts-per-game')[0]

>>> df.head()
   Rank          Team  2022  Last 3  Last 1  Home  Away  2021
0     1  Philadelphia  33.2    40.0    32.0  33.9  32.3  31.5
1     2       Atlanta  32.9    34.0    35.0  35.0  30.5  23.1
2     3       Chicago  32.8    24.3    22.0  32.8  32.9  27.9
3     4    Washington  31.6    37.0    41.0  30.7  32.8  28.1
4     5     Cleveland  31.3    28.7    22.0  33.6  29.2  28.5

# Always look types
>>> df.dtypes
Rank        int64
Team       object
2022      float64
Last 3    float64
Last 1    float64
Home      float64
Away      float64
2021      float64
dtype: object

So the table and types look ok.

When you write regular Python loop like you do in Pandas,it almost guarantee to be wrong approach to do it in Pandas.
To give a similar example of what you try to do,let say if Last 1 has values over 30 we multiple Home bye 17.

import pandas as pd

df = pd.read_html('https://www.teamrankings.com/nfl/stat/rushing-attempts-per-game')[0]
mask = df["Last 1"] > 30
df.loc[mask, "Home"] = df.loc[mask, "Home"] * 17

>>> df.head(8)
   Rank          Team  2022  Last 3  Last 1   Home  Away  2021
0     1  Philadelphia  33.2    40.0    32.0  576.3  32.3  31.5
1     2       Atlanta  32.9    34.0    35.0  595.0  30.5  23.1
2     3       Chicago  32.8    24.3    22.0   32.8  32.9  27.9
3     4    Washington  31.6    37.0    41.0  521.9  32.8  28.1
4     5     Cleveland  31.3    28.7    22.0   33.6  29.2  28.5
5     6     Baltimore  31.2    30.0    35.0  532.1  31.1  30.4
6     7        Dallas  30.9    28.0    22.0   30.0  31.8  27.4
7     8     NY Giants  30.0    23.7    20.0   33.0  27.3  24.6

As you see no loop,work with built-in on whole DataFrame,this is also a lot faster appcorch.

idratherbecoding · Apr-14-2023, 04:27 PM

Thank you so much for taking the time write such a detailed response. This really helped me clean up my code quite a bit. You are right, that is much easier and faster. Thanks!

html_table_parser_python3 KeyError odd behavior

User Panel Messages

Announcements