Python Forum - pandas df inside a df question

hello

here is my initial code

# Parse the input CSV file
df = pd.read_csv('employees.csv')

# Filter out employees who have not taken the training
df = df[df['Training'] == 'No']

im trying to understand

df[df['Training']==No]

I understand the first inner

df['Training']

This returns only the Training column data. When I add the == No to the back side of that, it turns that data output into a Boolean value. No's become True, while everything else becomes False.

Output:0    Yes
1     No
2     No
3     No
4    Yes
5     No
6     No
7    Yes
8     No
9     No
Name: Training, dtype: object

Output:0    False
1     True
2     True
3     True
4    False
5     True
6     True
7    False
8     True
9     True
Name: Training, dtype: bool

But if I add that back into another df[] like this:

df[df['Training']==No]

then the output joins the rest of the csv file and looks like this

Output:             Name              Department Training           Boss Email
1        John Doe         Human Resources       No  [email protected]
2     James Smith             Engineering       No  [email protected]
3   Jane Anderson             Engineering       No  [email protected]
5  Derrick Wheels  Information Technology       No   [email protected]
6   George Thomas         Human Resources       No  [email protected]
8   Brandon Combs  Information Technology       No   [email protected]
9    Jason Baxter              Management       No   [email protected]

I dont understand how this happens. How does putting all that inside another df[] filter the original csv files for training that equals No, and then put it all back inside the main csv file?

Does anyone have a better way of explaining it to me?

Thank you in advance,

mbaker_wv

Read this:

https://pandas.pydata.org/docs/getting_s..._data.html

@deanhystad

So do I understand this now?

df['Training']=='No' output returns a 'series' which is one dimensional and only shows a single column.

placing it back inside another df[] returns a dataframe which is 2-dimensional and shows both columns and rows.

mbaker_wv

I have a DataFrame

import pandas as pd

df = pd.DataFrame(range(1, 7), columns=["numbers"])

Output:   numbers
0        1
1        2
2        3
3        4
4        5
5        6

I create a Series (kind of like a one column dataframe, kind of like a list or array). The Series contains True when the corresponding "numbers" is not evenly divisible by 2.

odd_series = df["numbers"] % 2 != 0
print(odd_series)

Output:0     True
1    False
2     True
3    False
4     True
5    False

I use this series to create a new dataframe, selecting only the rows from "df" that are True in "odd_series".

odd_df = df[odd_series]
print(odd_df)

Output:   numbers
0        1
2        3
4        5

Note that the original datafram "df" is unchanged. odd_series is also unchanged.

I can condense this:

import pandas as pd

df = pd.DataFrame(range(1, 7), columns=["numbers"])
print(df[df["numbers"] % 2 != 0])

@deanhystad
thank you for the output. I think im on my way to understanding this more now.

mbaker_wv