Unexpected output from df.loc when indexing by label

idratherbecoding · Apr-18-2023, 01:57 AM

I am back again with another pandas question. I am getting some unexpected output from df.loc. As part of a larger project, I broke my problem out into a smaller project to exclude any code that is extraneous to the issue I am having. Please see the below code:

import pandas as pd

if __name__ == '__main__':
    df = pd.read_html('https://www.pro-football-reference.com/players/M/MartSa01.htm', skiprows=[0], index_col=0)[0]

    print(df.index) #Verifying index has the values I am expecting

    print(df.loc['2022', 'Y/P']) #Intended behavior is to get row '2022' and column 'Y/P'

    print(df) #Print whole table to compare to output above

If you run this code, you'll see that the DataFrame is wide enough that pandas cuts out the relevant column, so if you are interested, you can go look at the url from the code to see the table. Aside from printing the index column and the whole DataFrame, which both output as expected, the output I am expecting is 47.7, the value in the referenced cell. The output I am getting is

Output:2013
47.2    47.7
Name: 2022, dtype: object

I can't seem to figure why I am getting this output. Any help resolving this would be appreciated. Thanks!

AJ

**deanhystad** · Apr-18-2023, 01:07 PM

There is some weird multi-indexing thing going on with 2013.

If you look at the list of index values you'll see that 2013 is suspiciously absent even though it appears in the dataframe when printed.

Output:Index(['2014', '2015', '2016', '2017', '2018', '2019', '2020', '2021', '2022',
       'Career', '7 yrs', '2 yrs', '1 yr'],
      dtype='object')
Year       Age   Tm  Pos  No.    G   GS  Pnt    Yds   Y/P RetYds    Net  NY/P Lng  TB    TB% In20  In20% Blck  AV
2013        23  DET    P    6   16    0   72   3399  47.2    250   2949  41.0  72  10  13.9%   22  30.6%    1   3
2014        24  DET    P  6.0   16  0.0   68   3138  46.1    311   2727  40.1  71   5   7.4%   29  42.6%    0   3
2015        25  DET    P  6.0   16  0.0   80   3679  46.0    263   3356  42.0  66   3   3.8%   25  31.3%    0   3
. . .

Then printed a row.

Output:df.loc['2017']
Year    2013 
Age     23          27
Tm      DET        DET
Pos     P            P

This series has 2 rows. The year, and 2013. If I print the '2022' row I see the same thing. The 2022 year values and the 2013 values.

Output:df.loc['2022']
Year    2013 
Age     23          32
Tm      DET        BUF
Pos     P            P

Why this is happening is beyond my scope. I looked at the web page, but it is too complicated for me to understand how the tables are created and why they are messing up the dataframe

idratherbecoding · Apr-18-2023, 01:18 PM

Thank you for taking a look and offering your input. I will keep working and see if I can figure it out. I will be sure to come back and let you know if I find a solution as I’m sure the eventual solution will likely be of general use as well. Thanks.

AJ

**deanhystad** · (This post was last modified: Apr-18-2023, 03:50 PM by deanhystad.)

Looked into it a bit more and the problem is due to not handling the multi-index columns correctly.

The columns for the table on the web page is multi-indexed/multi-leveled. You saw this indexed column and thought it messy, so you dropped it using skiprows=[0].

Without skiprows:

Output:       Unnamed: 1_level_0 Unnamed: 2_level_0 Unnamed: 3_level_0 Unnamed: 4_level_0 Games  ... Punting                  Unnamed: 19_level_0
Year                  Age                 Tm                Pos                No.     G  ...     TB% In20  In20% Blck                  AV  
2013                   23                DET                  P                6.0    16  ...   13.9%   22  30.6%    1                   3

With skiprows:

Output:Year       Age   Tm  Pos  No.    G   GS  Pnt    Yds   Y/P RetYds    Net  NY/P Lng  TB    TB% In20  In20% Blck  AV
2013        23  DET    P    6   16    0   72   3399  47.2    250   2949  41.0  72  10  13.9%   22  30.6%    1   3

This appears to fix the problem. bit it didn't. The columns are still multi-indexed. But now the second level columns are 2013, 23, DET, P... It looks like 2013 is the first row, but it is actually the second level of the column headers. This is why I could not print df.loc['2013'] and also why the information for 2013 is always printed. It is part of the column index.

There are probably multiple ways to fix this, such as keeping the multi-indexed columns and replacing the "Unnamed" labels with blanks. I didn't have any luck with that, but I could drop the top level of the columns.

import pandas as pd

df = pd.read_html('https://www.pro-football-reference.com/players/M/MartSa01.htm', index_col=0)[0]
df.columns = df.columns.droplevel(0)
print(df.loc['2022', 'Y/P'])

Output:
47.7

idratherbecoding · Apr-18-2023, 04:06 PM

Thank you! That definitely did not occur to me that skipping the first row when constructing the DataFrame was the culprit. I will be using that method for consistency with the rest of the code in my larger project, which all uses pandas, but if we were not able to solve that, I found another solution that does not use pandas after digging into the HTML and seeing how the tables are constructed. Maybe it will come in handy for someone if they don’t want to use pandas for whatever reason.

In the body of the table we’re looking at, each row has an id in the form ‘punting.{year}’ and within each row there is an attribute for each column called ‘data-stat’ that identifies what statistic is displayed in each column. The one we are after is ‘punt_yds_per_punt’. So, this code gives the same output as your solution.

import requests
from bs4 import BeautifulSoup

i = requests.get('https://www.pro-football-reference.com/players/M/MartSa01.htm')

soup = BeautifulSoup(i.content, 'html.parser')

year = 2022

ypp = soup.find('tr', id=f'punting.{year}').find('td', attrs={'data-stat': 'punt_yds_per_punt'}).string

print(ypp)

Output:
47.7

But the way you came up with is a lot cleaner, imo, so I will be using that method. Thank you again!

AJ

***snippsat*** · (This post was last modified: Apr-18-2023, 07:20 PM by snippsat.)

The Pandas solution is tricky because when read Table from web-site it create a MultiIndex of header row and first row(with year 2013).
Can solve it this way.

import pandas as pd
pd.set_option('display.expand_frame_repr', False)

if __name__ == '__main__':
    df = pd.read_html('https://www.pro-football-reference.com/players/M/MartSa01.htm')[0]
    #print(df)

>>> df.columns = df.iloc[0]
>>> df.columns = [f"col{str(i)}" for i in range(1, 21)]
>>> df
      col1    col2 col3 col4  col5  col6  col7  col8   col9  col10  col11  col12  col13  col14  col15  col16  col17  col18  col19  col20
0     2013      23  DET    P   6.0    16   0.0    72   3399   47.2    250   2949   41.0     72     10  13.9%     22  30.6%      1      3
1     2014      24  DET    P   6.0    16   0.0    68   3138   46.1    311   2727   40.1     71      5   7.4%     29  42.6%      0      3
2     2015      25  DET    P   6.0    16   0.0    80   3679   46.0    263   3356   42.0     66      3   3.8%     25  31.3%      0      3
3     2016      26  DET    P   6.0    16   0.0    62   3010   48.5    210   2740   44.2     63      3   4.8%     23  37.1%      0      3
4     2017      27  DET    P   6.0    10   0.0    41   1779   43.4    116   1543   37.6     64      6  14.6%     13  31.7%      0      1
5     2018      28  DET    P   6.0    16   0.0    74   3310   44.7    332   2898   39.2     61      4   5.4%     32  43.2%      0      2
6     2019      29  DET    P   6.0    16   0.0    76   3445   45.3    130   3175   41.8     62      7   9.2%     31  40.8%      0      2
7     2020      30  DEN    P   6.0    16   0.0    65   3040   46.8    226   2734   42.1     69      4   6.2%     19  29.2%      1      3
8     2021      31  DEN    P   6.0    17   0.0    67   3083   46.0    154   2869   42.8     68      3   4.5%     28  41.8%      0      2
9     2022      32  BUF    P   6.0    16   0.0    45   2146   47.7    151   1895   42.1     67      5  11.1%     16  35.6%      1      2
10  Career  Career  NaN  NaN   NaN   155   NaN   650  30029   46.2   2143  26886   41.4     72     50   7.7%    238  36.6%      3     24
11   7 yrs   7 yrs  DET  NaN   NaN   106   0.0   473  21760   46.0   1612  19388   41.0     72     38   8.0%    175  37.0%      1     17
12   2 yrs   2 yrs  DEN  NaN   NaN    33   0.0   132   6123   46.4    380   5603   42.4     69      7   5.3%     47  35.6%      1      5
13    1 yr    1 yr  BUF  NaN   NaN    16   0.0    45   2146   47.7    151   1895   42.1     67      5  11.1%     16  35.6%      1      2

So now have fixed the MultiIndex problem that was at start.
I have just auto renamed all columns,if do the job like so df.columns = ['Year', 'Age', 'Tm' ....]
Then will have a working DataFrame,and then start bye looking if types are correct.

>>> df.dtypes
col1      object
col2      object
col3      object
col4      object
col5     float64
col6       int64
col7     float64
col8       int64
col9       int64
col10    float64
col11      int64
col12      int64
col13    float64
col14      int64
col15      int64
col16     object
col17      int64
col18     object
col19      int64
col20      int64
dtype: object

Like col1(Year) need to be pandas datatime64 if you shall use years for something like eg a Plot.
Also the last 4 rows could be cut/delete,if need Career make to new DataFrame

**deanhystad** · (This post was last modified: Apr-19-2023, 12:11 AM by deanhystad.)

Figured out how to scrub "Unnamed" from the columns.

import pandas as pd


def scrub_unnamed(columns):
    """Replaces "Unamed: *" with blank in columns."""
    return pd.MultiIndex.from_tuples(
        [tuple("" if x.startswith("Unnamed:") else x for x in col) for col in columns]
    )


df = pd.read_html(
    "https://www.pro-football-reference.com/players/M/MartSa01.htm", index_col=0
)[0]
df.columns = scrub_unnamed(df.columns)
print(df)

Output:                              Games      Punting
           Age   Tm  Pos  No.     G   GS     Pnt    Yds   Y/P RetYds    Net  NY/P Lng  TB    TB% In20  In20% Blck  AV
2013        23  DET    P  6.0    16  0.0      72   3399  47.2    250   2949  41.0  72  10  13.9%   22  30.6%    1   3
2014        24  DET    P  6.0    16  0.0      68   3138  46.1    311   2727  40.1  71   5   7.4%   29  42.6%    0   3
2015        25  DET    P  6.0    16  0.0      80   3679  46.0    263   3356  42.0  66   3   3.8%   25  31.3%    0   3
2016        26  DET    P  6.0    16  0.0      62   3010  48.5    210   2740  44.2  63   3   4.8%   23  37.1%    0   3
2017        27  DET    P  6.0    10  0.0      41   1779  43.4    116   1543  37.6  64   6  14.6%   13  31.7%    0   1
2018        28  DET    P  6.0    16  0.0      74   3310  44.7    332   2898  39.2  61   4   5.4%   32  43.2%    0   2
2019        29  DET    P  6.0    16  0.0      76   3445  45.3    130   3175  41.8  62   7   9.2%   31  40.8%    0   2
2020        30  DEN    P  6.0    16  0.0      65   3040  46.8    226   2734  42.1  69   4   6.2%   19  29.2%    1   3
2021        31  DEN    P  6.0    17  0.0      67   3083  46.0    154   2869  42.8  68   3   4.5%   28  41.8%    0   2
2022        32  BUF    P  6.0    16  0.0      45   2146  47.7    151   1895  42.1  67   5  11.1%   16  35.6%    1   2
Career  Career  NaN  NaN  NaN   155  NaN     650  30029  46.2   2143  26886  41.4  72  50   7.7%  238  36.6%    3  24
7 yrs    7 yrs  DET  NaN  NaN   106  0.0     473  21760  46.0   1612  19388  41.0  72  38   8.0%  175  37.0%    1  17
2 yrs    2 yrs  DEN  NaN  NaN    33  0.0     132   6123  46.4    380   5603  42.4  69   7   5.3%   47  35.6%    1   5
1 yr      1 yr  BUF  NaN  NaN    16  0.0      45   2146  47.7    151   1895  42.1  67   5  11.1%   16  35.6%    1   2

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Unexpected output	Starter	2	495	Nov-22-2023, 12:08 AM Last Post: Starter
	Unexpected Output - Python Dataframes: Filtering based on Overlapping Dates	Xensor	5	722	Nov-15-2023, 06:54 PM Last Post: deanhystad
	Unexpected output while using random.randint with def	terickson2367	1	520	Oct-24-2023, 05:56 AM Last Post: buran
	unexpected output	asyrafcc99	0	1,510	Oct-24-2020, 02:40 PM Last Post: asyrafcc99
	Unexpected output: symbols for derivative not being displayed	saucerdesigner	0	2,061	Jun-22-2020, 10:06 PM Last Post: saucerdesigner
	How to change 0 based indexing to 1 based indexing in python..??	Ruthra	2	4,351	Jan-22-2020, 05:13 PM Last Post: Ruthra
	Unexpected output	palladium	4	2,761	Jan-11-2020, 03:26 PM Last Post: palladium
	Unexpected output: if statement	CabbageMan	1	1,771	Sep-04-2019, 04:12 PM Last Post: ThomasL
	Unexpected Output using classes and inheritance	langley	2	1,958	Jul-04-2019, 09:33 AM Last Post: langley
	float multiplication - unexpected output	inesk	3	3,359	Dec-11-2018, 10:59 AM Last Post: DeaD_EyE

Unexpected output from df.loc when indexing by label

User Panel Messages

Announcements