Python Forum
Unexpected output from df.loc when indexing by label
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Unexpected output from df.loc when indexing by label
#1
I am back again with another pandas question. I am getting some unexpected output from df.loc. As part of a larger project, I broke my problem out into a smaller project to exclude any code that is extraneous to the issue I am having. Please see the below code:

import pandas as pd

if __name__ == '__main__':
    df = pd.read_html('https://www.pro-football-reference.com/players/M/MartSa01.htm', skiprows=[0], index_col=0)[0]

    print(df.index) #Verifying index has the values I am expecting

    print(df.loc['2022', 'Y/P']) #Intended behavior is to get row '2022' and column 'Y/P'

    print(df) #Print whole table to compare to output above
If you run this code, you'll see that the DataFrame is wide enough that pandas cuts out the relevant column, so if you are interested, you can go look at the url from the code to see the table. Aside from printing the index column and the whole DataFrame, which both output as expected, the output I am expecting is 47.7, the value in the referenced cell. The output I am getting is

Output:
2013 47.2 47.7 Name: 2022, dtype: object
I can't seem to figure why I am getting this output. Any help resolving this would be appreciated. Thanks!

AJ
Reply
#2
There is some weird multi-indexing thing going on with 2013.

If you look at the list of index values you'll see that 2013 is suspiciously absent even though it appears in the dataframe when printed.
Output:
Index(['2014', '2015', '2016', '2017', '2018', '2019', '2020', '2021', '2022', 'Career', '7 yrs', '2 yrs', '1 yr'], dtype='object') Year Age Tm Pos No. G GS Pnt Yds Y/P RetYds Net NY/P Lng TB TB% In20 In20% Blck AV 2013 23 DET P 6 16 0 72 3399 47.2 250 2949 41.0 72 10 13.9% 22 30.6% 1 3 2014 24 DET P 6.0 16 0.0 68 3138 46.1 311 2727 40.1 71 5 7.4% 29 42.6% 0 3 2015 25 DET P 6.0 16 0.0 80 3679 46.0 263 3356 42.0 66 3 3.8% 25 31.3% 0 3 . . .
Then printed a row.
Output:
df.loc['2017'] Year 2013 Age 23 27 Tm DET DET Pos P P
This series has 2 rows. The year, and 2013. If I print the '2022' row I see the same thing. The 2022 year values and the 2013 values.
Output:
df.loc['2022'] Year 2013 Age 23 32 Tm DET BUF Pos P P
Why this is happening is beyond my scope. I looked at the web page, but it is too complicated for me to understand how the tables are created and why they are messing up the dataframe
idratherbecoding likes this post
Reply
#3
Thank you for taking a look and offering your input. I will keep working and see if I can figure it out. I will be sure to come back and let you know if I find a solution as I’m sure the eventual solution will likely be of general use as well. Thanks.

AJ
Reply
#4
Looked into it a bit more and the problem is due to not handling the multi-index columns correctly.

The columns for the table on the web page is multi-indexed/multi-leveled. You saw this indexed column and thought it messy, so you dropped it using skiprows=[0].

Without skiprows:
Output:
Unnamed: 1_level_0 Unnamed: 2_level_0 Unnamed: 3_level_0 Unnamed: 4_level_0 Games ... Punting Unnamed: 19_level_0 Year Age Tm Pos No. G ... TB% In20 In20% Blck AV 2013 23 DET P 6.0 16 ... 13.9% 22 30.6% 1 3
With skiprows:
Output:
Year Age Tm Pos No. G GS Pnt Yds Y/P RetYds Net NY/P Lng TB TB% In20 In20% Blck AV 2013 23 DET P 6 16 0 72 3399 47.2 250 2949 41.0 72 10 13.9% 22 30.6% 1 3
This appears to fix the problem. bit it didn't. The columns are still multi-indexed. But now the second level columns are 2013, 23, DET, P... It looks like 2013 is the first row, but it is actually the second level of the column headers. This is why I could not print df.loc['2013'] and also why the information for 2013 is always printed. It is part of the column index.

There are probably multiple ways to fix this, such as keeping the multi-indexed columns and replacing the "Unnamed" labels with blanks. I didn't have any luck with that, but I could drop the top level of the columns.
import pandas as pd

df = pd.read_html('https://www.pro-football-reference.com/players/M/MartSa01.htm', index_col=0)[0]
df.columns = df.columns.droplevel(0)
print(df.loc['2022', 'Y/P'])
Output:
47.7
idratherbecoding likes this post
Reply
#5
Thank you! That definitely did not occur to me that skipping the first row when constructing the DataFrame was the culprit. I will be using that method for consistency with the rest of the code in my larger project, which all uses pandas, but if we were not able to solve that, I found another solution that does not use pandas after digging into the HTML and seeing how the tables are constructed. Maybe it will come in handy for someone if they don’t want to use pandas for whatever reason.

In the body of the table we’re looking at, each row has an id in the form ‘punting.{year}’ and within each row there is an attribute for each column called ‘data-stat’ that identifies what statistic is displayed in each column. The one we are after is ‘punt_yds_per_punt’. So, this code gives the same output as your solution.

import requests
from bs4 import BeautifulSoup

i = requests.get('https://www.pro-football-reference.com/players/M/MartSa01.htm')

soup = BeautifulSoup(i.content, 'html.parser')

year = 2022

ypp = soup.find('tr', id=f'punting.{year}').find('td', attrs={'data-stat': 'punt_yds_per_punt'}).string

print(ypp)
Output:
47.7
But the way you came up with is a lot cleaner, imo, so I will be using that method. Thank you again!

AJ
Reply
#6
The Pandas solution is tricky because when read Table from web-site it create a MultiIndex of header row and first row(with year 2013).
Can solve it this way.
import pandas as pd
pd.set_option('display.expand_frame_repr', False)

if __name__ == '__main__':
    df = pd.read_html('https://www.pro-football-reference.com/players/M/MartSa01.htm')[0]
    #print(df)
>>> df.columns = df.iloc[0]
>>> df.columns = [f"col{str(i)}" for i in range(1, 21)]
>>> df
      col1    col2 col3 col4  col5  col6  col7  col8   col9  col10  col11  col12  col13  col14  col15  col16  col17  col18  col19  col20
0     2013      23  DET    P   6.0    16   0.0    72   3399   47.2    250   2949   41.0     72     10  13.9%     22  30.6%      1      3
1     2014      24  DET    P   6.0    16   0.0    68   3138   46.1    311   2727   40.1     71      5   7.4%     29  42.6%      0      3
2     2015      25  DET    P   6.0    16   0.0    80   3679   46.0    263   3356   42.0     66      3   3.8%     25  31.3%      0      3
3     2016      26  DET    P   6.0    16   0.0    62   3010   48.5    210   2740   44.2     63      3   4.8%     23  37.1%      0      3
4     2017      27  DET    P   6.0    10   0.0    41   1779   43.4    116   1543   37.6     64      6  14.6%     13  31.7%      0      1
5     2018      28  DET    P   6.0    16   0.0    74   3310   44.7    332   2898   39.2     61      4   5.4%     32  43.2%      0      2
6     2019      29  DET    P   6.0    16   0.0    76   3445   45.3    130   3175   41.8     62      7   9.2%     31  40.8%      0      2
7     2020      30  DEN    P   6.0    16   0.0    65   3040   46.8    226   2734   42.1     69      4   6.2%     19  29.2%      1      3
8     2021      31  DEN    P   6.0    17   0.0    67   3083   46.0    154   2869   42.8     68      3   4.5%     28  41.8%      0      2
9     2022      32  BUF    P   6.0    16   0.0    45   2146   47.7    151   1895   42.1     67      5  11.1%     16  35.6%      1      2
10  Career  Career  NaN  NaN   NaN   155   NaN   650  30029   46.2   2143  26886   41.4     72     50   7.7%    238  36.6%      3     24
11   7 yrs   7 yrs  DET  NaN   NaN   106   0.0   473  21760   46.0   1612  19388   41.0     72     38   8.0%    175  37.0%      1     17
12   2 yrs   2 yrs  DEN  NaN   NaN    33   0.0   132   6123   46.4    380   5603   42.4     69      7   5.3%     47  35.6%      1      5
13    1 yr    1 yr  BUF  NaN   NaN    16   0.0    45   2146   47.7    151   1895   42.1     67      5  11.1%     16  35.6%      1      2
So now have fixed the MultiIndex problem that was at start.
I have just auto renamed all columns,if do the job like so df.columns = ['Year', 'Age', 'Tm' ....]
Then will have a working DataFrame,and then start bye looking if types are correct.
>>> df.dtypes
col1      object
col2      object
col3      object
col4      object
col5     float64
col6       int64
col7     float64
col8       int64
col9       int64
col10    float64
col11      int64
col12      int64
col13    float64
col14      int64
col15      int64
col16     object
col17      int64
col18     object
col19      int64
col20      int64
dtype: object 
Like col1(Year) need to be pandas datatime64 if you shall use years for something like eg a Plot.
Also the last 4 rows could be cut/delete,if need Career make to new DataFrame
idratherbecoding likes this post
Reply
#7
Figured out how to scrub "Unnamed" from the columns.
import pandas as pd


def scrub_unnamed(columns):
    """Replaces "Unamed: *" with blank in columns."""
    return pd.MultiIndex.from_tuples(
        [tuple("" if x.startswith("Unnamed:") else x for x in col) for col in columns]
    )


df = pd.read_html(
    "https://www.pro-football-reference.com/players/M/MartSa01.htm", index_col=0
)[0]
df.columns = scrub_unnamed(df.columns)
print(df)
Output:
Games Punting Age Tm Pos No. G GS Pnt Yds Y/P RetYds Net NY/P Lng TB TB% In20 In20% Blck AV 2013 23 DET P 6.0 16 0.0 72 3399 47.2 250 2949 41.0 72 10 13.9% 22 30.6% 1 3 2014 24 DET P 6.0 16 0.0 68 3138 46.1 311 2727 40.1 71 5 7.4% 29 42.6% 0 3 2015 25 DET P 6.0 16 0.0 80 3679 46.0 263 3356 42.0 66 3 3.8% 25 31.3% 0 3 2016 26 DET P 6.0 16 0.0 62 3010 48.5 210 2740 44.2 63 3 4.8% 23 37.1% 0 3 2017 27 DET P 6.0 10 0.0 41 1779 43.4 116 1543 37.6 64 6 14.6% 13 31.7% 0 1 2018 28 DET P 6.0 16 0.0 74 3310 44.7 332 2898 39.2 61 4 5.4% 32 43.2% 0 2 2019 29 DET P 6.0 16 0.0 76 3445 45.3 130 3175 41.8 62 7 9.2% 31 40.8% 0 2 2020 30 DEN P 6.0 16 0.0 65 3040 46.8 226 2734 42.1 69 4 6.2% 19 29.2% 1 3 2021 31 DEN P 6.0 17 0.0 67 3083 46.0 154 2869 42.8 68 3 4.5% 28 41.8% 0 2 2022 32 BUF P 6.0 16 0.0 45 2146 47.7 151 1895 42.1 67 5 11.1% 16 35.6% 1 2 Career Career NaN NaN NaN 155 NaN 650 30029 46.2 2143 26886 41.4 72 50 7.7% 238 36.6% 3 24 7 yrs 7 yrs DET NaN NaN 106 0.0 473 21760 46.0 1612 19388 41.0 72 38 8.0% 175 37.0% 1 17 2 yrs 2 yrs DEN NaN NaN 33 0.0 132 6123 46.4 380 5603 42.4 69 7 5.3% 47 35.6% 1 5 1 yr 1 yr BUF NaN NaN 16 0.0 45 2146 47.7 151 1895 42.1 67 5 11.1% 16 35.6% 1 2
idratherbecoding likes this post
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Unexpected output Starter 2 495 Nov-22-2023, 12:08 AM
Last Post: Starter
  Unexpected Output - Python Dataframes: Filtering based on Overlapping Dates Xensor 5 722 Nov-15-2023, 06:54 PM
Last Post: deanhystad
  Unexpected output while using random.randint with def terickson2367 1 520 Oct-24-2023, 05:56 AM
Last Post: buran
  unexpected output asyrafcc99 0 1,510 Oct-24-2020, 02:40 PM
Last Post: asyrafcc99
  Unexpected output: symbols for derivative not being displayed saucerdesigner 0 2,061 Jun-22-2020, 10:06 PM
Last Post: saucerdesigner
  How to change 0 based indexing to 1 based indexing in python..?? Ruthra 2 4,351 Jan-22-2020, 05:13 PM
Last Post: Ruthra
  Unexpected output palladium 4 2,761 Jan-11-2020, 03:26 PM
Last Post: palladium
  Unexpected output: if statement CabbageMan 1 1,771 Sep-04-2019, 04:12 PM
Last Post: ThomasL
  Unexpected Output using classes and inheritance langley 2 1,958 Jul-04-2019, 09:33 AM
Last Post: langley
  float multiplication - unexpected output inesk 3 3,359 Dec-11-2018, 10:59 AM
Last Post: DeaD_EyE

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020