Python Forum

Full Version: for loop in dataframe in pandas
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hello,

I have a problem with a "for loop" using a dataframe in pandas, hope somebody can help with that.

I have the following dataframe in a csv file:

,forename,surname,gender,age,100m,200m,400m,800m,1500m
0,Migdalia,Parrish,F,18,11.08,29.0,59.41,122.05,259.11
1,Valerie,Lee,F,10,17.23,46.0,100.02,232.64,480.95
2,John,Debnam,M,17,10.81,25.89,50.6,110.29,232.39
3,Roy,Miller,M,10,19.18,46.74,95.32,201.14,430.27
4,Aida,Aumiller,F,11,15.3,41.83,81.06,189.03,394.9
5,Marcia,Brown,F,19,11.13,24.62,57.59,119.13,256.37
6,Harry,Knows,M,16,12.39,25.94,49.67,106.56,237.14
7,Barry,Lennon,M,14,11.15,23.56,46.46,110.89,230.49
8,Lilia,Armstrong,F,13,8.84,25.09,59.54,128.95,258.47
9,Johnny,Casey,M,15,9.65,22.67,49.46,112.85,233.87
10,Donald,Taylor,M,15,11.74,22.42,49.22,114.62,224.63
11,Martha,Woods,F,14,9.01,24.34,55.25,118.8,254.87
12,Diane,Lauria,F,15,8.99,27.92,54.79,119.89,249.21
13,Yvonne,Pumphrey,F,16,8.84,27.29,57.63,123.13,247.41
14,Betty,Stephenson,F,14,11.04,28.73,59.05,126.29,256.44
15,Lilia,Armstrong,F,12,11.31,34.43,74.28,150.05,321.07

And I have to create a main function that calls another function that, using a "for loop", retrieves the fastest time for each age (10,11,12,13,14,15,16) for a specific gender (e.g. 'F') and distance (e.g. '100m').

For example:
Input:
fastest_athletes = find_fastest_athletes(df,"100m","F",[10,11,12,13,14,15,16])
Output:
{
10: {’forename’: 'Valerie’, 'surname’: 'Lee’, 'time’: '17.23’},
11: {’forename’: 'Aida’, 'surname’: 'Aumiller’, 'time’: '15.3’},
12: {’forename’: 'Lilia’, 'surname’: 'Armstrong’, 'time’: '11.31’},
13: {’forename’: 'Lilia’, 'surname’: 'Armstrong’, 'time’: '8.84’},
14: {’forename’: 'Martha’, 'surname’: 'Woods’, 'time’: '9.01’},
15: {’forename’: 'Diane’, 'surname’: 'Lauria’, 'time’: '8.99’},
16: {’forename’: 'Yvonne’, 'surname’: 'Pumphrey’, 'time’: '8.84’}
}

I did the following code:

# Function with the for loop
def find_fastest_athletes(df,distance,gender,ages):
  for age in range(10,16):
    fastest_athletes = df[(df["gender"] == gender) & (df["age"] == age)]
    fastest_athletes_sorted = fastest_athletes.sort_values(distance,ascending=True)
    fastest_athletes_value = fastest_athletes_sorted.iloc[[0]][["forename","surname","100m"]]
    athletes_data = fastest_athletes_value.to_string(index=False, header=False).split('  ')
    athletes_data_dict = {
        'forename': athletes_data[0].strip(),
        'surname': athletes_data[1],
        'time': float(athletes_data[2])
        }
  return athletes_data_dict
  
# Main function
def main(filename='athletes.csv'):
    df = pd.read_csv(filename, index_col=0)
    df['100m'] = df['100m'].astype(float)
    print(find_fastest_athletes(df,'100m','F',[10,11,12,13,14,15,16]))
    return
   
if __name__ == "__main__":
  main()  
With my coding I get as output ONLY the fastest athlete for the last age (16 year's old) and not ALL the fastest athletes for each age (10,11,12,13,14,15,16), why is that?

Also how can I add the age at the beginning of each line?
It seems that in your "athletes_data_dict" you are using fixed key names (forename,surname,time),
over and over again. Keys should be unique in a dictionary.

Paul
(Dec-01-2021, 04:21 PM)DPaul Wrote: [ -> ]It seems that in your "athletes_data_dict" you are using fixed key names (forename,surname,time),
over and over again. Keys should be unique in a dictionary.

Paul

If I take out the "for age in range(10,16)" and run the code for only one age (e.g. 16) it works perfectly.
The problem is when I want the fastest athlete for each age, I get only the last of the loop.

I think that that can be related on how I wrote the for loop, but tried different way and still getting only one output instead of 7
Each time through the loop you create (and overwrite if already created) athletes_data_dict. You're not storing the one for each age anywhere. You need a collection like a list and append the dict for each age to it.

Then return the collection instead of athletes_data_dict which has only the last age in it.
(Dec-01-2021, 05:28 PM)bowlofred Wrote: [ -> ]Each time through the loop you create (and overwrite if already created) athletes_data_dict. You're not storing the one for each age anywhere. You need a collection like a list and append the dict for each age to it.

Then return the collection instead of athletes_data_dict which has only the last age in it.

Thanks for that, now I'm starting to understand how it works. I then added in the loop of the function an empty list called collection[] to add every time the output from the loop for each age as shown below:

def find_fastest_athletes(df,distance,gender,age):
  for age in range(11,17,1):
    fastest_athletes = df[(df["gender"] == gender) & (df["age"] == age)]
    fastest_athletes_sorted = fastest_athletes.sort_values(distance,ascending=True)
    fastest_athletes_value = fastest_athletes_sorted.iloc[[0]][["forename","surname","100m"]]
    athletes_data = fastest_athletes_value.to_string(index=False, header=False).split('  ')
    athletes_data_dict = {
        'forename': athletes_data[0].strip(),
        'surname': athletes_data[1],
        'time': float(athletes_data[2])
        }
    collection=[]
    collection.append(athletes_data_dict)
  return collection


But now I don't understand why I'm still getting only the last fastest athlete (16 years' old), shouldn't now add every time to the collection list the new athlete from the loop?
You're creating a (new, empty) collection inside the loop. So each time through you throw away the old one.

Create the collection outside the loop.
append to it inside the loop.
Return the collection after the loop.
(Dec-01-2021, 07:50 PM)bowlofred Wrote: [ -> ]You're creating a (new, empty) collection inside the loop. So each time through you throw away the old one.

Create the collection outside the loop.
append to it inside the loop.
Return the collection after the loop.

Many thanks for that, I think that now I have the full coding working, please see below:

def find_fastest_athletes(df,distance,gender,ages):
  data=[]
  for age in ages:
    fastest_athletes = df.loc[(df.gender == gender) & (df.age == age)]
    fastest_athletes_sorted = fastest_athletes.sort_values(distance,ascending=True)
    fastest_athletes_value = fastest_athletes_sorted.iloc[[0]][["forename","surname","100m"]]
    athletes_data = fastest_athletes_value.to_string(index=False, header=False).split('  ')
    athletes_data_dict ={
        'forename': athletes_data[0].strip(),
        'surname': athletes_data[1],
        'time': float(athletes_data[2])
    }
    athletes_data_dict_num = (age,athletes_data_dict)
    data.append(athletes_data_dict_num)
  return data

def main(filename='athletes.csv'):
    df = pd.read_csv(filename, index_col=0)
    df['100m'] = df['100m'].astype(float)
    print(find_fastest_athletes(df,'100m','F',[10,11,12,13,14,15,16]))
    return
Btw: just a small thing: now I have all the 7 outputs in one single line, how can I get them in 7 separate lines when I append the data? I looked all over Internet to find the right command to go to the next line when using "append" but no success.
Not sure what you mean by a line. You have a data structure. Looks like data is a list. It contains tuples of (age, athlete_data), and athlete_data is a dict of forename,surname,time.

You can print any part of it however you want.

athletes = find_fastest_athletes(df,'100m','F',[10,11,12,13,14,15,16])
for age, athlete_info in athletes:
    print(f"Age:{age} - Info:{athlete_info}")