Simple pandas question

mcva · Nov-25-2021, 12:26 PM

I´m using the following code to estimate the moving average for each column (year X) of a dataframe considering a window (pandas.DataFrame.rolling first parameter) that varies from 1 to 100.
The code is fine but when I need to increase the number of columns in the dataframe I have to include a lot of new lines, hence the probability of making a mistake increases considerably. I have tried to solve this using a numpy array but I came across a very different problem that i was not able to solve.

Can this code be improved to handle the increase of the dataframe rows?
Thank you

# Import data
data = pd.read_excel('moving_average.xlsx', index_col=0, header=0)
data.columns = ['year1','year2','year3','year4', 'year5','year6','year7', 'year8', 'year9']


# Add filtered data
tmp1 = data.loc[:,['year1','year2','year3','year4', 'year5','year6','year7', 'year8', 'year9']].rolling(1, center=False, axis=0).mean()
tmp1.columns = ['year1Filter_1', 'year2Filter_1', 'year3Filter_1','year4Filter_1','year5Filter_1','year6Filter_1','year7Filter_1','year8Filter_1','year9Filter_1']

tmp2 = data.loc[:,['year1','year2','year3','year4', 'year5','year6','year7', 'year8', 'year9']].rolling(2, center=False, axis=0).mean()
tmp2.columns = ['year1Filter_2', 'year2Filter_2', 'year3Filter_2','year4Filter_2','year5Filter_2','year6Filter_2','year7Filter_2','year8Filter_2','year9Filter_2']

tmp3 = data.loc[:,['year1','year2','year3','year4', 'year5','year6','year7', 'year8', 'year9']].rolling(3, center=False, axis=0).mean()
tmp3.columns = ['year1Filter_3', 'year2Filter_3', 'year3Filter_3','year4Filter_3','year5Filter_3','year6Filter_3','year7Filter_3','year8Filter_3','year9Filter3']

....
tmp100 = data.loc[:,['year1','year2','year3','year4', 'year5','year6','year7', 'year8', 'year9']].rolling(100, center=False, axis=0).mean()
tmp3.columns = ['year1Filter_100', 'year2Filter_100', 'year3Filter_100','year4Filter_100','year5Filter_100','year6Filter_100','year7Filter_100','year8Filter_100','year9Filter100']

data = pd.concat((data, tmp1, tmp2,tmp3,...tmp100), axis=1)

sel1 = data.columns.map(lambda x: bool(re.search('year1',x)))
out1 = data[data.columns[sel1]]

sel2 = data.columns.map(lambda x: bool(re.search('year2',x)))
out2 = data[data.columns[sel2]]

sel3 = data.columns.map(lambda x: bool(re.search('year3',x)))
out3 = data[data.columns[sel3]]
...
sel9 = data.columns.map(lambda x: bool(re.search('year9',x)))
out3 = data[data.columns[sel3]]

#Export results to Excel
writer = pd.ExcelWriter('year1.xlsx')
out1.to_excel(writer,'data')
writer.save()
...
writer = pd.ExcelWriter('year9.xlsx')
out9.to_excel(writer,'data')
writer.save()

jefsummers · Nov-30-2021, 03:13 AM

Yes, a bit messy and I am sure can be simplified, but it is unclear to me what you want to do. Can you explain better (words, I see the code)?

mcva · Nov-30-2021, 02:15 PM

(Nov-30-2021, 03:13 AM)jefsummers Wrote: Yes, a bit messy and I am sure can be simplified, but it is unclear to me what you want to do. Can you explain better (words, I see the code)?

I have a dataframe with 1000 columns all with the same length (365 values), and I need to calculate de moving average of each column. But I need to do this for diferent periods, from 1 to 100. A period is the pandas.DataFrame.rolling function first parameter.
So for each of the 1000 columns (years) I need to get 100 new datasets (They will have a diferent length with the increase of the moving average period, this is the reason why I can´t use a numpy array instead of a dataframe)

In my code I´m doing the following:

1) tm1, tm2 and tm3 are the moving averages considering a period of 1, 2 and 3 for each of the nine columns. I need to do this until I reach tm100...
2) Then I´m using concat() to create a single dataframe.
3) I´m using re.search() to reorganize the columns by year.
4) Export to excel.

Thank you

jefsummers · Dec-05-2021, 01:51 PM

Use a list of dataframes, tm[0..100] rather than 100 variables. Will be way easier to maintain. You can still use all of your other methods, but put the process into loop or list comprehensions - will be shorter, easier to maintain, and faster.

mcva · Dec-17-2021, 04:47 PM

This is a good solution...

import pandas as pd
import numpy as np

# Import data
data = pd.read_excel('moving_average.xlsx', index_col=0, header=0)

results = []
for i in range(100):
    tmp = data.rolling(i+1, center=False, axis=0).mean()
    results.append(tmp)

df=pd.concat(results, axis=1)

df.to_excel("output.xlsx")

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	pandas df inside a df question	mbaker_wv	4	1,180	Dec-25-2022, 01:11 AM Last Post: mbaker_wv
	Pandas usecols question	rsearing	1	1,240	Aug-20-2022, 10:10 PM Last Post: jefsummers
	Simple question (I guess)	tchadrack	1	1,315	Jan-08-2022, 06:36 AM Last Post: buran
	Pandas question	new2datasci	0	1,943	Jan-10-2021, 01:29 AM Last Post: new2datasci
	Pandas merge question	smw10c	2	5,710	Jul-02-2020, 06:56 PM Last Post: hussainmujtaba
	Counting Criteria in Pandas Question	Koenig	1	2,157	Sep-30-2019, 05:16 AM Last Post: perfringo
	Simple String to Time within a pandas dataframe	Ecniv	1	2,507	Jun-14-2019, 03:25 AM Last Post: scidam
	Function question using Pandas	smw10c	7	7,077	Feb-12-2019, 06:52 PM Last Post: Nathandsn
	Simple pandas dataframe question	popohoma	1	3,540	Jan-03-2019, 05:00 PM Last Post: ashlardev
	question on pandas datareader	kit12_31	3	9,212	Feb-05-2018, 11:55 PM Last Post: snippsat

Simple pandas question

User Panel Messages

Announcements