Python Forum

I have a pandas data frame of six columns, I would like to iteratively compute the variance along each column. Since I am a newbie, I don't really understand the niceties of the language and common usage patterns. What is the common Python idiom for achieving the following?

vars = []
for i in range(1, 100000):
    v = (data.iloc[range(0, i+1)].var()).values
    if len(vars) == 0:
        vars = v
    else:
        vars = np.vstack((vars, v))

Also, when I run this code, it takes a long time to execute. Can anyone suggest how to improve the running time?

Notice that vars is a reserved word, it is better not to use it as the name of a variable...

You can obtain what I think is what you are trying to do with:

pd.DataFrame(([data[:k].var() for k in range(1, 10000)]))

In a low level language in this case it might be better to go to the mathematical definition of variance and calculate the mean and the variance in each step, updating your accumulators... in python this might be not so optimum as you will need to access element by element your array.

In general, in python try to not to iterate by index (your range(1, 100000)) as it is specially inefficient. Normally all the python operations are "vectorized" to work with the full column, row or matrix. Take a look at the numpy and pandas documentation.

vvvcvvcv

killerrex