Posts: 8
Threads: 3
Joined: Apr 2023
I am back again with another pandas question. I am getting some unexpected output from df.loc. As part of a larger project, I broke my problem out into a smaller project to exclude any code that is extraneous to the issue I am having. Please see the below code:
import pandas as pd
if __name__ == '__main__':
df = pd.read_html('https://www.pro-football-reference.com/players/M/MartSa01.htm', skiprows=[0], index_col=0)[0]
print(df.index) #Verifying index has the values I am expecting
print(df.loc['2022', 'Y/P']) #Intended behavior is to get row '2022' and column 'Y/P'
print(df) #Print whole table to compare to output above If you run this code, you'll see that the DataFrame is wide enough that pandas cuts out the relevant column, so if you are interested, you can go look at the url from the code to see the table. Aside from printing the index column and the whole DataFrame, which both output as expected, the output I am expecting is 47.7, the value in the referenced cell. The output I am getting is
Output: 2013
47.2 47.7
Name: 2022, dtype: object
I can't seem to figure why I am getting this output. Any help resolving this would be appreciated. Thanks!
AJ
Posts: 6,779
Threads: 20
Joined: Feb 2020
There is some weird multi-indexing thing going on with 2013.
If you look at the list of index values you'll see that 2013 is suspiciously absent even though it appears in the dataframe when printed.
Output: Index(['2014', '2015', '2016', '2017', '2018', '2019', '2020', '2021', '2022',
'Career', '7 yrs', '2 yrs', '1 yr'],
dtype='object')
Year Age Tm Pos No. G GS Pnt Yds Y/P RetYds Net NY/P Lng TB TB% In20 In20% Blck AV
2013 23 DET P 6 16 0 72 3399 47.2 250 2949 41.0 72 10 13.9% 22 30.6% 1 3
2014 24 DET P 6.0 16 0.0 68 3138 46.1 311 2727 40.1 71 5 7.4% 29 42.6% 0 3
2015 25 DET P 6.0 16 0.0 80 3679 46.0 263 3356 42.0 66 3 3.8% 25 31.3% 0 3
. . .
Then printed a row.
Output: df.loc['2017']
Year 2013
Age 23 27
Tm DET DET
Pos P P
This series has 2 rows. The year, and 2013. If I print the '2022' row I see the same thing. The 2022 year values and the 2013 values.
Output: df.loc['2022']
Year 2013
Age 23 32
Tm DET BUF
Pos P P
Why this is happening is beyond my scope. I looked at the web page, but it is too complicated for me to understand how the tables are created and why they are messing up the dataframe
idratherbecoding likes this post
Posts: 8
Threads: 3
Joined: Apr 2023
Thank you for taking a look and offering your input. I will keep working and see if I can figure it out. I will be sure to come back and let you know if I find a solution as I’m sure the eventual solution will likely be of general use as well. Thanks.
AJ
Posts: 6,779
Threads: 20
Joined: Feb 2020
Apr-18-2023, 03:50 PM
(This post was last modified: Apr-18-2023, 03:50 PM by deanhystad.)
Looked into it a bit more and the problem is due to not handling the multi-index columns correctly.
The columns for the table on the web page is multi-indexed/multi-leveled. You saw this indexed column and thought it messy, so you dropped it using skiprows=[0].
Without skiprows:
Output: Unnamed: 1_level_0 Unnamed: 2_level_0 Unnamed: 3_level_0 Unnamed: 4_level_0 Games ... Punting Unnamed: 19_level_0
Year Age Tm Pos No. G ... TB% In20 In20% Blck AV
2013 23 DET P 6.0 16 ... 13.9% 22 30.6% 1 3
With skiprows:
Output: Year Age Tm Pos No. G GS Pnt Yds Y/P RetYds Net NY/P Lng TB TB% In20 In20% Blck AV
2013 23 DET P 6 16 0 72 3399 47.2 250 2949 41.0 72 10 13.9% 22 30.6% 1 3
This appears to fix the problem. bit it didn't. The columns are still multi-indexed. But now the second level columns are 2013, 23, DET, P... It looks like 2013 is the first row, but it is actually the second level of the column headers. This is why I could not print df.loc['2013'] and also why the information for 2013 is always printed. It is part of the column index.
There are probably multiple ways to fix this, such as keeping the multi-indexed columns and replacing the "Unnamed" labels with blanks. I didn't have any luck with that, but I could drop the top level of the columns.
import pandas as pd
df = pd.read_html('https://www.pro-football-reference.com/players/M/MartSa01.htm', index_col=0)[0]
df.columns = df.columns.droplevel(0)
print(df.loc['2022', 'Y/P']) Output: 47.7
idratherbecoding likes this post
Posts: 8
Threads: 3
Joined: Apr 2023
Thank you! That definitely did not occur to me that skipping the first row when constructing the DataFrame was the culprit. I will be using that method for consistency with the rest of the code in my larger project, which all uses pandas, but if we were not able to solve that, I found another solution that does not use pandas after digging into the HTML and seeing how the tables are constructed. Maybe it will come in handy for someone if they don’t want to use pandas for whatever reason.
In the body of the table we’re looking at, each row has an id in the form ‘punting.{year}’ and within each row there is an attribute for each column called ‘data-stat’ that identifies what statistic is displayed in each column. The one we are after is ‘punt_yds_per_punt’. So, this code gives the same output as your solution.
import requests
from bs4 import BeautifulSoup
i = requests.get('https://www.pro-football-reference.com/players/M/MartSa01.htm')
soup = BeautifulSoup(i.content, 'html.parser')
year = 2022
ypp = soup.find('tr', id=f'punting.{year}').find('td', attrs={'data-stat': 'punt_yds_per_punt'}).string
print(ypp) Output: 47.7
But the way you came up with is a lot cleaner, imo, so I will be using that method. Thank you again!
AJ
Posts: 7,312
Threads: 123
Joined: Sep 2016
Apr-18-2023, 07:20 PM
(This post was last modified: Apr-18-2023, 07:20 PM by snippsat.)
The Pandas solution is tricky because when read Table from web-site it create a MultiIndex of header row and first row(with year 2013).
Can solve it this way.
import pandas as pd
pd.set_option('display.expand_frame_repr', False)
if __name__ == '__main__':
df = pd.read_html('https://www.pro-football-reference.com/players/M/MartSa01.htm')[0]
#print(df) >>> df.columns = df.iloc[0]
>>> df.columns = [f"col{str(i)}" for i in range(1, 21)]
>>> df
col1 col2 col3 col4 col5 col6 col7 col8 col9 col10 col11 col12 col13 col14 col15 col16 col17 col18 col19 col20
0 2013 23 DET P 6.0 16 0.0 72 3399 47.2 250 2949 41.0 72 10 13.9% 22 30.6% 1 3
1 2014 24 DET P 6.0 16 0.0 68 3138 46.1 311 2727 40.1 71 5 7.4% 29 42.6% 0 3
2 2015 25 DET P 6.0 16 0.0 80 3679 46.0 263 3356 42.0 66 3 3.8% 25 31.3% 0 3
3 2016 26 DET P 6.0 16 0.0 62 3010 48.5 210 2740 44.2 63 3 4.8% 23 37.1% 0 3
4 2017 27 DET P 6.0 10 0.0 41 1779 43.4 116 1543 37.6 64 6 14.6% 13 31.7% 0 1
5 2018 28 DET P 6.0 16 0.0 74 3310 44.7 332 2898 39.2 61 4 5.4% 32 43.2% 0 2
6 2019 29 DET P 6.0 16 0.0 76 3445 45.3 130 3175 41.8 62 7 9.2% 31 40.8% 0 2
7 2020 30 DEN P 6.0 16 0.0 65 3040 46.8 226 2734 42.1 69 4 6.2% 19 29.2% 1 3
8 2021 31 DEN P 6.0 17 0.0 67 3083 46.0 154 2869 42.8 68 3 4.5% 28 41.8% 0 2
9 2022 32 BUF P 6.0 16 0.0 45 2146 47.7 151 1895 42.1 67 5 11.1% 16 35.6% 1 2
10 Career Career NaN NaN NaN 155 NaN 650 30029 46.2 2143 26886 41.4 72 50 7.7% 238 36.6% 3 24
11 7 yrs 7 yrs DET NaN NaN 106 0.0 473 21760 46.0 1612 19388 41.0 72 38 8.0% 175 37.0% 1 17
12 2 yrs 2 yrs DEN NaN NaN 33 0.0 132 6123 46.4 380 5603 42.4 69 7 5.3% 47 35.6% 1 5
13 1 yr 1 yr BUF NaN NaN 16 0.0 45 2146 47.7 151 1895 42.1 67 5 11.1% 16 35.6% 1 2 So now have fixed the MultiIndex problem that was at start.
I have just auto renamed all columns,if do the job like so df.columns = ['Year', 'Age', 'Tm' ....]
Then will have a working DataFrame,and then start bye looking if types are correct.
>>> df.dtypes
col1 object
col2 object
col3 object
col4 object
col5 float64
col6 int64
col7 float64
col8 int64
col9 int64
col10 float64
col11 int64
col12 int64
col13 float64
col14 int64
col15 int64
col16 object
col17 int64
col18 object
col19 int64
col20 int64
dtype: object Like col1(Year) need to be pandas datatime64 if you shall use years for something like eg a Plot.
Also the last 4 rows could be cut/delete,if need Career make to new DataFrame
idratherbecoding likes this post
Posts: 6,779
Threads: 20
Joined: Feb 2020
Apr-19-2023, 12:11 AM
(This post was last modified: Apr-19-2023, 12:11 AM by deanhystad.)
Figured out how to scrub "Unnamed" from the columns.
import pandas as pd
def scrub_unnamed(columns):
"""Replaces "Unamed: *" with blank in columns."""
return pd.MultiIndex.from_tuples(
[tuple("" if x.startswith("Unnamed:") else x for x in col) for col in columns]
)
df = pd.read_html(
"https://www.pro-football-reference.com/players/M/MartSa01.htm", index_col=0
)[0]
df.columns = scrub_unnamed(df.columns)
print(df) Output: Games Punting
Age Tm Pos No. G GS Pnt Yds Y/P RetYds Net NY/P Lng TB TB% In20 In20% Blck AV
2013 23 DET P 6.0 16 0.0 72 3399 47.2 250 2949 41.0 72 10 13.9% 22 30.6% 1 3
2014 24 DET P 6.0 16 0.0 68 3138 46.1 311 2727 40.1 71 5 7.4% 29 42.6% 0 3
2015 25 DET P 6.0 16 0.0 80 3679 46.0 263 3356 42.0 66 3 3.8% 25 31.3% 0 3
2016 26 DET P 6.0 16 0.0 62 3010 48.5 210 2740 44.2 63 3 4.8% 23 37.1% 0 3
2017 27 DET P 6.0 10 0.0 41 1779 43.4 116 1543 37.6 64 6 14.6% 13 31.7% 0 1
2018 28 DET P 6.0 16 0.0 74 3310 44.7 332 2898 39.2 61 4 5.4% 32 43.2% 0 2
2019 29 DET P 6.0 16 0.0 76 3445 45.3 130 3175 41.8 62 7 9.2% 31 40.8% 0 2
2020 30 DEN P 6.0 16 0.0 65 3040 46.8 226 2734 42.1 69 4 6.2% 19 29.2% 1 3
2021 31 DEN P 6.0 17 0.0 67 3083 46.0 154 2869 42.8 68 3 4.5% 28 41.8% 0 2
2022 32 BUF P 6.0 16 0.0 45 2146 47.7 151 1895 42.1 67 5 11.1% 16 35.6% 1 2
Career Career NaN NaN NaN 155 NaN 650 30029 46.2 2143 26886 41.4 72 50 7.7% 238 36.6% 3 24
7 yrs 7 yrs DET NaN NaN 106 0.0 473 21760 46.0 1612 19388 41.0 72 38 8.0% 175 37.0% 1 17
2 yrs 2 yrs DEN NaN NaN 33 0.0 132 6123 46.4 380 5603 42.4 69 7 5.3% 47 35.6% 1 5
1 yr 1 yr BUF NaN NaN 16 0.0 45 2146 47.7 151 1895 42.1 67 5 11.1% 16 35.6% 1 2
idratherbecoding likes this post
|