Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Simple pandas question
#1
I´m using the following code to estimate the moving average for each column (year X) of a dataframe considering a window (pandas.DataFrame.rolling first parameter) that varies from 1 to 100.
The code is fine but when I need to increase the number of columns in the dataframe I have to include a lot of new lines, hence the probability of making a mistake increases considerably. I have tried to solve this using a numpy array but I came across a very different problem that i was not able to solve.

Can this code be improved to handle the increase of the dataframe rows?
Thank you


# Import data
data = pd.read_excel('moving_average.xlsx', index_col=0, header=0)
data.columns = ['year1','year2','year3','year4', 'year5','year6','year7', 'year8', 'year9']


# Add filtered data
tmp1 = data.loc[:,['year1','year2','year3','year4', 'year5','year6','year7', 'year8', 'year9']].rolling(1, center=False, axis=0).mean()
tmp1.columns = ['year1Filter_1', 'year2Filter_1', 'year3Filter_1','year4Filter_1','year5Filter_1','year6Filter_1','year7Filter_1','year8Filter_1','year9Filter_1']

tmp2 = data.loc[:,['year1','year2','year3','year4', 'year5','year6','year7', 'year8', 'year9']].rolling(2, center=False, axis=0).mean()
tmp2.columns = ['year1Filter_2', 'year2Filter_2', 'year3Filter_2','year4Filter_2','year5Filter_2','year6Filter_2','year7Filter_2','year8Filter_2','year9Filter_2']

tmp3 = data.loc[:,['year1','year2','year3','year4', 'year5','year6','year7', 'year8', 'year9']].rolling(3, center=False, axis=0).mean()
tmp3.columns = ['year1Filter_3', 'year2Filter_3', 'year3Filter_3','year4Filter_3','year5Filter_3','year6Filter_3','year7Filter_3','year8Filter_3','year9Filter3']

....
tmp100 = data.loc[:,['year1','year2','year3','year4', 'year5','year6','year7', 'year8', 'year9']].rolling(100, center=False, axis=0).mean()
tmp3.columns = ['year1Filter_100', 'year2Filter_100', 'year3Filter_100','year4Filter_100','year5Filter_100','year6Filter_100','year7Filter_100','year8Filter_100','year9Filter100']

data = pd.concat((data, tmp1, tmp2,tmp3,...tmp100), axis=1)

sel1 = data.columns.map(lambda x: bool(re.search('year1',x)))
out1 = data[data.columns[sel1]]

sel2 = data.columns.map(lambda x: bool(re.search('year2',x)))
out2 = data[data.columns[sel2]]

sel3 = data.columns.map(lambda x: bool(re.search('year3',x)))
out3 = data[data.columns[sel3]]
...
sel9 = data.columns.map(lambda x: bool(re.search('year9',x)))
out3 = data[data.columns[sel3]]

#Export results to Excel
writer = pd.ExcelWriter('year1.xlsx')
out1.to_excel(writer,'data')
writer.save()
...
writer = pd.ExcelWriter('year9.xlsx')
out9.to_excel(writer,'data')
writer.save()
likes this post
Reply
#2
Yes, a bit messy and I am sure can be simplified, but it is unclear to me what you want to do. Can you explain better (words, I see the code)?
mcva likes this post
Reply
#3
(Nov-30-2021, 03:13 AM)jefsummers Wrote: Yes, a bit messy and I am sure can be simplified, but it is unclear to me what you want to do. Can you explain better (words, I see the code)?

I have a dataframe with 1000 columns all with the same length (365 values), and I need to calculate de moving average of each column. But I need to do this for diferent periods, from 1 to 100. A period is the pandas.DataFrame.rolling function first parameter.
So for each of the 1000 columns (years) I need to get 100 new datasets (They will have a diferent length with the increase of the moving average period, this is the reason why I can´t use a numpy array instead of a dataframe)


In my code I´m doing the following:

1) tm1, tm2 and tm3 are the moving averages considering a period of 1, 2 and 3 for each of the nine columns. I need to do this until I reach tm100...
2) Then I´m using concat() to create a single dataframe.
3) I´m using re.search() to reorganize the columns by year.
4) Export to excel.


Thank you
Reply
#4
Use a list of dataframes, tm[0..100] rather than 100 variables. Will be way easier to maintain. You can still use all of your other methods, but put the process into loop or list comprehensions - will be shorter, easier to maintain, and faster.
mcva likes this post
Reply
#5
This is a good solution...

import pandas as pd
import numpy as np

# Import data
data = pd.read_excel('moving_average.xlsx', index_col=0, header=0)

results = []
for i in range(100):
    tmp = data.rolling(i+1, center=False, axis=0).mean()
    results.append(tmp)

df=pd.concat(results, axis=1)

df.to_excel("output.xlsx")
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  pandas df inside a df question mbaker_wv 4 1,180 Dec-25-2022, 01:11 AM
Last Post: mbaker_wv
  Pandas usecols question rsearing 1 1,240 Aug-20-2022, 10:10 PM
Last Post: jefsummers
  Simple question (I guess) tchadrack 1 1,315 Jan-08-2022, 06:36 AM
Last Post: buran
  Pandas question new2datasci 0 1,943 Jan-10-2021, 01:29 AM
Last Post: new2datasci
  Pandas merge question smw10c 2 5,710 Jul-02-2020, 06:56 PM
Last Post: hussainmujtaba
  Counting Criteria in Pandas Question Koenig 1 2,157 Sep-30-2019, 05:16 AM
Last Post: perfringo
  Simple String to Time within a pandas dataframe Ecniv 1 2,507 Jun-14-2019, 03:25 AM
Last Post: scidam
  Function question using Pandas smw10c 7 7,077 Feb-12-2019, 06:52 PM
Last Post: Nathandsn
  Simple pandas dataframe question popohoma 1 3,540 Jan-03-2019, 05:00 PM
Last Post: ashlardev
  question on pandas datareader kit12_31 3 9,212 Feb-05-2018, 11:55 PM
Last Post: snippsat

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020