Python Forum

Dear Python Experts,
I am looking and Daniel Breen´s project* about housing prices and GDP in the US.
I noticed about half way through In [37] he is doing the following:

economy_df['Next Quarter GDP'] = list(economy_df['GDP (billions)'].iloc[1:]) + [np.NAN]
economy_df['Two Quarters GDP'] = list(economy_df['GDP (billions)'].iloc[2:]) + 2*[np.NAN]

It seems that the first line goes down one row and takes the value and puts it in a column while
the second line takes the value from two rows down and puts it in a column.
Code wise I dont really understand it. Why the list? What does the np.NAN do?
When I take the list(...) away the whole thing does not go down 1 or 2 rows anymore.
Is there another way to achieve the same functionality?

Many thanks for any ideas and a great weekend.

*http://danielbreen.net/projects/housing_prices_college_towns/

He wants to shift/lag GDP to have current value and value from next record in same row.

So he takes df['GDP'] and with iloc removes the first value. He cant assign it directly as a new column (well, he can, but that won't work, df['GDP'] is series based on the same index as df and direct assignment would assign values on original rows, except NaN for first row).

Thats why he "removes" the index by converting to list and fills it with np.NaN to same length as the original df. After that he can assign it as a new column. When you remove list(), adding pd.Series and [np.NaN] results in pd.Series where np.NaN is added to each value in pd.Series.

And yes, this is unnecessary complicated. As shifting/lagging is very common, pandas provides function shift() that can do it directly.

Example dataframe:

Hide/Show

Output:In [43]: data = {'Quarter': ["1Q1", "1Q2", "1Q3"], "GDP":[132,136, 140]}

In [44]: df = pd.DataFrame(data, columns=["Quarter", "GDP"])

In [45]: df
Out[45]: 
  Quarter  GDP
0     1Q1  132
1     1Q2  136
2     1Q3  140

His way:

Hide/Show

Output:In [46]: df.GDP.iloc[1:]
Out[46]: 
1    136
2    140
Name: GDP, dtype: int64

In [47]: df.GDP.iloc[1:] + [np.NaN] # does not work
Out[47]: 
1   NaN
2   NaN
Name: GDP, dtype: float64

In [48]: list(df.GDP.iloc[1:]) + [np.NaN]
Out[48]: [136, 140, nan]

In [49]: df['Next Quarter GDP'] = list(df.GDP.iloc[1:]) + [np.NaN]

In [50]: df
Out[50]: 
  Quarter  GDP  Next Quarter GDP
0     1Q1  132             136.0
1     1Q2  136             140.0
2     1Q3  140               NaN

You can see that index values was preserved in output [46]. And while GDP column had int64 dtype, Next Quarter GDP was implicitly promoted to float64 (no support for integer NaN)

Simpler way:

Hide/Show

Output:In [51]: df['Next Quarter GDP - shift'] = df.GDP.shift(-1)

In [52]: df
Out[52]: 
  Quarter  GDP  Next Quarter GDP  Next Quarter GDP - shift
0     1Q1  132             136.0                     136.0
1     1Q2  136             140.0                     140.0
2     1Q3  140               NaN                       NaN

Incredible! You are the best zivoni !
I had no clue about .shift() or that this method is called shift and lag.

metalray

zivoni

metalray