Python Forum
Looping Through Large Data Sets - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: General Coding Help (https://python-forum.io/forum-8.html)
+--- Thread: Looping Through Large Data Sets (/thread-30063.html)

Pages: 1 2


Looping Through Large Data Sets - JoeDainton123 - Oct-02-2020

Hello all

I have a csv file with over 60,000 rows of data which i have imported into a dataframe, I want to perform varies calculations on this dataframe.

I am looping through the data using a while loop and a counter.

At each iteration of the counter i am performing some arithmetic, the code i have is as follows:-

q = 0
while q < len(df_raw_data):
    df_raw_data.iloc[q,5] = (df_raw_data.iloc[q,3] * 8950) + (df_raw_data.iloc[q,4])
    q = q + 1
This code works and i fully understand it however it takes ages and i wanted to know if there was a better way of loop through large data sets.

Thank you


RE: Looping Through Large Data Sets - buran - Oct-02-2020

you don't work like this in pandas
your code can be replaced with
df_raw_data[5] = df_raw_data[3] * 8950 + df_raw_data[4]
sample with dummy df:
import pandas as pd 

df = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(df)
df[3] = df[1]*5 + df[2]
print(df)



RE: Looping Through Large Data Sets - JoeDainton123 - Oct-02-2020

Hi buran

Thank you for your reply.

I really like your way however when i attempt this i get an error.

The code i am using is as follows:-

 df_raw_data[5] = df_raw_data[3] * 1760 + df_raw_data[4]
The error i get is:-

df_raw_data[5] = df_raw_data[3] * 1760 + df_raw_data[4]
Traceback (most recent call last):

  File "C:\Anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 2889, in get_loc
    return self._engine.get_loc(casted_key)

  File "pandas\_libs\index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc

  File "pandas\_libs\index.pyx", line 97, in pandas._libs.index.IndexEngine.get_loc

  File "pandas\_libs\hashtable_class_helper.pxi", line 1675, in pandas._libs.hashtable.PyObjectHashTable.get_item

  File "pandas\_libs\hashtable_class_helper.pxi", line 1683, in pandas._libs.hashtable.PyObjectHashTable.get_item

KeyError: 3


The above exception was the direct cause of the following exception:

Traceback (most recent call last):

  File "<ipython-input-5-d11bee5b17a9>", line 1, in <module>
    df_raw_data[5] = df_raw_data[3] * 1760 + df_raw_data[4]

  File "C:\Anaconda3\lib\site-packages\pandas\core\frame.py", line 2902, in __getitem__
    indexer = self.columns.get_loc(key)

  File "C:\Anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 2891, in get_loc
    raise KeyError(key) from err

KeyError: 3
I can make it work using the following code:-

df_raw_data.iloc[:,5] = df_raw_data.iloc[:,3] * 1760 + df_raw_data.iloc[:,4]
But i am not sure why i keep getting an error?


RE: Looping Through Large Data Sets - buran - Oct-03-2020

can you provide sample dataframe (no need of 60 000 rows, just 2-3 rows is sufficient)
Ideal would be to provide a code that creates a DataFrame with the same structure as yours - like I did in my example which works.


RE: Looping Through Large Data Sets - JoeDainton123 - Oct-16-2020

(Oct-03-2020, 04:06 AM)buran Wrote: can you provide sample dataframe (no need of 60 000 rows, just 2-3 rows is sufficient)
Ideal would be to provide a code that creates a DataFrame with the same structure as yours - like I did in my example which works.

Hi buran

My apologies for not getting back to you sooner.

I have tried several ways to get your method of looping through the data however I still cannot get it to work and I dont understand why.

I have re-created the data set along with the code you suggested which is:-

import pandas
 
raw_data = pandas.DataFrame(columns=["Column_0", "Column_1", "Column_2", "Column_3", "Column_4", "Column_5", "Column_6", "Column_7", "Column_8"])

raw_data.loc[0,:] = ("315589", "CHZ3", "1100", "218", "694.63", "nan", "-1.589", "0", "1.3694")

raw_data.loc[1,:] = ("364048", "CHZ3", "1100", "320", "12.09", "nan", "-7.216", "0", "59.89")

raw_data[5] = raw_data[3] * 8950 + raw_data[4]



#raw_data.iloc[:,5] = raw_data.iloc[:,3] * 8950 + raw_data.iloc[:,4]
If you run the code as it is you will see that i get an error if i follow your method, however if comment out the code you suggested and uncomment out the last line it works fine.

But i want to understand why your suggestion does not work?

I think i am going crazy?

Thank you.


RE: Looping Through Large Data Sets - buran - Oct-17-2020

Your dataframe has column names, so
raw_data['Column_5'] = raw_data['Column_3'] * 8950 + raw_data['Column_4']



RE: Looping Through Large Data Sets - JoeDainton123 - Oct-17-2020

Hi Buran

Thank you for your response.

I think I understand, if no column names are specified in the dataframe then the values in the square brackets are the column index numbers.

If column names are specified then the name of the column needs to be included inside the square brackets - would this be correct?

I personally prefer to use row and column index numbers through the iloc command:-
 df_raw_data.iloc[:,5] = df_raw_data.iloc[:,3] * 1760 + df_raw_data.iloc[:,4] 
But i guess both achieve the same thing.

Although the change you suggested which was to add the column names works:-
raw_data['Column_5'] = raw_data['Column_3'] * 8950 + raw_data['Column_4']
This works however the arithmetic does not for example the values in column 5 is 2182181218218..., the code just copies the value from Column3 8950 times???

I am not certain why this is?


RE: Looping Through Large Data Sets - buran - Oct-17-2020

In your sample dataframe all values are strings, e.g. "218" vs 218. Both code yield the same result.


RE: Looping Through Large Data Sets - JoeDainton123 - Oct-17-2020

(Oct-17-2020, 02:44 PM)buran Wrote: In your sample dataframe all values are strings, e.g. "218" vs 218. Both code yield the same result.

Your right.

Thank Buran I really appreciate your help.


RE: Looping Through Large Data Sets - jefsummers - Oct-18-2020

I'd be interested as to whether there is a difference in performance. Do you gain or lose speed by specifying df[:,col] vs df[col]? With a 60K row dataframe the difference might be noticeable if there is one.