Looping Through Large Data Sets

JoeDainton123 · Oct-02-2020, 06:59 PM

Hello all

I have a csv file with over 60,000 rows of data which i have imported into a dataframe, I want to perform varies calculations on this dataframe.

I am looping through the data using a while loop and a counter.

At each iteration of the counter i am performing some arithmetic, the code i have is as follows:-

q = 0
while q < len(df_raw_data):
    df_raw_data.iloc[q,5] = (df_raw_data.iloc[q,3] * 8950) + (df_raw_data.iloc[q,4])
    q = q + 1

This code works and i fully understand it however it takes ages and i wanted to know if there was a better way of loop through large data sets.

Thank you

**buran** · (This post was last modified: Oct-02-2020, 07:08 PM by buran.)

you don't work like this in pandas
your code can be replaced with

df_raw_data[5] = df_raw_data[3] * 8950 + df_raw_data[4]

sample with dummy df:

import pandas as pd 

df = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(df)
df[3] = df[1]*5 + df[2]
print(df)

JoeDainton123 · (This post was last modified: Oct-02-2020, 09:28 PM by JoeDainton123.)

Hi buran

Thank you for your reply.

I really like your way however when i attempt this i get an error.

The code i am using is as follows:-

 df_raw_data[5] = df_raw_data[3] * 1760 + df_raw_data[4]

The error i get is:-

df_raw_data[5] = df_raw_data[3] * 1760 + df_raw_data[4]
Traceback (most recent call last):

  File "C:\Anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 2889, in get_loc
    return self._engine.get_loc(casted_key)

  File "pandas\_libs\index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc

  File "pandas\_libs\index.pyx", line 97, in pandas._libs.index.IndexEngine.get_loc

  File "pandas\_libs\hashtable_class_helper.pxi", line 1675, in pandas._libs.hashtable.PyObjectHashTable.get_item

  File "pandas\_libs\hashtable_class_helper.pxi", line 1683, in pandas._libs.hashtable.PyObjectHashTable.get_item

KeyError: 3


The above exception was the direct cause of the following exception:

Traceback (most recent call last):

  File "<ipython-input-5-d11bee5b17a9>", line 1, in <module>
    df_raw_data[5] = df_raw_data[3] * 1760 + df_raw_data[4]

  File "C:\Anaconda3\lib\site-packages\pandas\core\frame.py", line 2902, in __getitem__
    indexer = self.columns.get_loc(key)

  File "C:\Anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 2891, in get_loc
    raise KeyError(key) from err

KeyError: 3

I can make it work using the following code:-

df_raw_data.iloc[:,5] = df_raw_data.iloc[:,3] * 1760 + df_raw_data.iloc[:,4]

But i am not sure why i keep getting an error?

**buran** · Oct-03-2020, 04:06 AM

can you provide sample dataframe (no need of 60 000 rows, just 2-3 rows is sufficient)
Ideal would be to provide a code that creates a DataFrame with the same structure as yours - like I did in my example which works.

JoeDainton123 · Oct-16-2020, 09:41 PM

(Oct-03-2020, 04:06 AM)buran Wrote: can you provide sample dataframe (no need of 60 000 rows, just 2-3 rows is sufficient)
Ideal would be to provide a code that creates a DataFrame with the same structure as yours - like I did in my example which works.

Hi buran

My apologies for not getting back to you sooner.

I have tried several ways to get your method of looping through the data however I still cannot get it to work and I dont understand why.

I have re-created the data set along with the code you suggested which is:-

import pandas
 
raw_data = pandas.DataFrame(columns=["Column_0", "Column_1", "Column_2", "Column_3", "Column_4", "Column_5", "Column_6", "Column_7", "Column_8"])

raw_data.loc[0,:] = ("315589", "CHZ3", "1100", "218", "694.63", "nan", "-1.589", "0", "1.3694")

raw_data.loc[1,:] = ("364048", "CHZ3", "1100", "320", "12.09", "nan", "-7.216", "0", "59.89")

raw_data[5] = raw_data[3] * 8950 + raw_data[4]



#raw_data.iloc[:,5] = raw_data.iloc[:,3] * 8950 + raw_data.iloc[:,4]

If you run the code as it is you will see that i get an error if i follow your method, however if comment out the code you suggested and uncomment out the last line it works fine.

But i want to understand why your suggestion does not work?

I think i am going crazy?

Thank you.

**buran** · Oct-17-2020, 06:27 AM

Your dataframe has column names, so

raw_data['Column_5'] = raw_data['Column_3'] * 8950 + raw_data['Column_4']

JoeDainton123 · Oct-17-2020, 11:13 AM

Hi Buran

Thank you for your response.

I think I understand, if no column names are specified in the dataframe then the values in the square brackets are the column index numbers.

If column names are specified then the name of the column needs to be included inside the square brackets - would this be correct?

I personally prefer to use row and column index numbers through the iloc command:-

 df_raw_data.iloc[:,5] = df_raw_data.iloc[:,3] * 1760 + df_raw_data.iloc[:,4]

But i guess both achieve the same thing.

Although the change you suggested which was to add the column names works:-

raw_data['Column_5'] = raw_data['Column_3'] * 8950 + raw_data['Column_4']

This works however the arithmetic does not for example the values in column 5 is 2182181218218..., the code just copies the value from Column3 8950 times???

I am not certain why this is?

**buran** · Oct-17-2020, 02:44 PM

In your sample dataframe all values are strings, e.g. "218" vs 218. Both code yield the same result.

JoeDainton123 · Oct-17-2020, 07:53 PM

(Oct-17-2020, 02:44 PM)buran Wrote: In your sample dataframe all values are strings, e.g. "218" vs 218. Both code yield the same result.

Your right.

Thank Buran I really appreciate your help.

jefsummers · Oct-18-2020, 12:53 PM

I'd be interested as to whether there is a difference in performance. Do you gain or lose speed by specifying df[:,col] vs df[col]? With a 60K row dataframe the difference might be noticeable if there is one.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Peculiar pattern from printing of sets	SahandJ	7	2,886	Dec-29-2021, 06:31 PM Last Post: bowlofred
	How does one combine 2 data sets ?	detlefschmitt	2	2,338	Sep-03-2021, 03:38 AM Last Post: detlefschmitt
	Looping to read data in database	CEC68	1	2,307	Sep-24-2020, 08:54 PM Last Post: scidam
	comprehension for sets	Skaperen	2	2,530	Aug-07-2020, 10:12 PM Last Post: Skaperen
	Extract data from large string	pzig98	1	2,716	Jul-20-2020, 12:39 AM Last Post: Larz60+
	Moving large amount of data between MySql and Sql Server using Python	ste80adr	4	5,245	Apr-24-2020, 01:24 PM Last Post: Jeff900
	alternative to nested loops for large data set	JonnyEnglish	2	3,595	Feb-19-2020, 11:26 PM Last Post: JonnyEnglish
	Looping JSON data	graham23s	1	2,648	Jul-01-2019, 09:37 PM Last Post: nilamo
	Sort sets by item values	Sergey	4	98,014	Apr-19-2019, 10:50 AM Last Post: Sergey
	Problem with character sets	Pedroski55	4	5,480	Mar-04-2019, 02:35 AM Last Post: snippsat

Looping Through Large Data Sets

User Panel Messages

Announcements