Python Forum
Looping Through Large Data Sets
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Looping Through Large Data Sets
#1
Hello all

I have a csv file with over 60,000 rows of data which i have imported into a dataframe, I want to perform varies calculations on this dataframe.

I am looping through the data using a while loop and a counter.

At each iteration of the counter i am performing some arithmetic, the code i have is as follows:-

q = 0
while q < len(df_raw_data):
    df_raw_data.iloc[q,5] = (df_raw_data.iloc[q,3] * 8950) + (df_raw_data.iloc[q,4])
    q = q + 1
This code works and i fully understand it however it takes ages and i wanted to know if there was a better way of loop through large data sets.

Thank you
Reply
#2
you don't work like this in pandas
your code can be replaced with
df_raw_data[5] = df_raw_data[3] * 8950 + df_raw_data[4]
sample with dummy df:
import pandas as pd 

df = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(df)
df[3] = df[1]*5 + df[2]
print(df)
JoeDainton123 and ndc85430 like this post
If you can't explain it to a six year old, you don't understand it yourself, Albert Einstein
How to Ask Questions The Smart Way: link and another link
Create MCV example
Debug small programs

Reply
#3
Hi buran

Thank you for your reply.

I really like your way however when i attempt this i get an error.

The code i am using is as follows:-

 df_raw_data[5] = df_raw_data[3] * 1760 + df_raw_data[4]
The error i get is:-

df_raw_data[5] = df_raw_data[3] * 1760 + df_raw_data[4]
Traceback (most recent call last):

  File "C:\Anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 2889, in get_loc
    return self._engine.get_loc(casted_key)

  File "pandas\_libs\index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc

  File "pandas\_libs\index.pyx", line 97, in pandas._libs.index.IndexEngine.get_loc

  File "pandas\_libs\hashtable_class_helper.pxi", line 1675, in pandas._libs.hashtable.PyObjectHashTable.get_item

  File "pandas\_libs\hashtable_class_helper.pxi", line 1683, in pandas._libs.hashtable.PyObjectHashTable.get_item

KeyError: 3


The above exception was the direct cause of the following exception:

Traceback (most recent call last):

  File "<ipython-input-5-d11bee5b17a9>", line 1, in <module>
    df_raw_data[5] = df_raw_data[3] * 1760 + df_raw_data[4]

  File "C:\Anaconda3\lib\site-packages\pandas\core\frame.py", line 2902, in __getitem__
    indexer = self.columns.get_loc(key)

  File "C:\Anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 2891, in get_loc
    raise KeyError(key) from err

KeyError: 3
I can make it work using the following code:-

df_raw_data.iloc[:,5] = df_raw_data.iloc[:,3] * 1760 + df_raw_data.iloc[:,4]
But i am not sure why i keep getting an error?
Reply
#4
can you provide sample dataframe (no need of 60 000 rows, just 2-3 rows is sufficient)
Ideal would be to provide a code that creates a DataFrame with the same structure as yours - like I did in my example which works.
JoeDainton123 likes this post
If you can't explain it to a six year old, you don't understand it yourself, Albert Einstein
How to Ask Questions The Smart Way: link and another link
Create MCV example
Debug small programs

Reply
#5
(Oct-03-2020, 04:06 AM)buran Wrote: can you provide sample dataframe (no need of 60 000 rows, just 2-3 rows is sufficient)
Ideal would be to provide a code that creates a DataFrame with the same structure as yours - like I did in my example which works.

Hi buran

My apologies for not getting back to you sooner.

I have tried several ways to get your method of looping through the data however I still cannot get it to work and I dont understand why.

I have re-created the data set along with the code you suggested which is:-

import pandas
 
raw_data = pandas.DataFrame(columns=["Column_0", "Column_1", "Column_2", "Column_3", "Column_4", "Column_5", "Column_6", "Column_7", "Column_8"])

raw_data.loc[0,:] = ("315589", "CHZ3", "1100", "218", "694.63", "nan", "-1.589", "0", "1.3694")

raw_data.loc[1,:] = ("364048", "CHZ3", "1100", "320", "12.09", "nan", "-7.216", "0", "59.89")

raw_data[5] = raw_data[3] * 8950 + raw_data[4]



#raw_data.iloc[:,5] = raw_data.iloc[:,3] * 8950 + raw_data.iloc[:,4]
If you run the code as it is you will see that i get an error if i follow your method, however if comment out the code you suggested and uncomment out the last line it works fine.

But i want to understand why your suggestion does not work?

I think i am going crazy?

Thank you.
Reply
#6
Your dataframe has column names, so
raw_data['Column_5'] = raw_data['Column_3'] * 8950 + raw_data['Column_4']
If you can't explain it to a six year old, you don't understand it yourself, Albert Einstein
How to Ask Questions The Smart Way: link and another link
Create MCV example
Debug small programs

Reply
#7
Hi Buran

Thank you for your response.

I think I understand, if no column names are specified in the dataframe then the values in the square brackets are the column index numbers.

If column names are specified then the name of the column needs to be included inside the square brackets - would this be correct?

I personally prefer to use row and column index numbers through the iloc command:-
 df_raw_data.iloc[:,5] = df_raw_data.iloc[:,3] * 1760 + df_raw_data.iloc[:,4] 
But i guess both achieve the same thing.

Although the change you suggested which was to add the column names works:-
raw_data['Column_5'] = raw_data['Column_3'] * 8950 + raw_data['Column_4']
This works however the arithmetic does not for example the values in column 5 is 2182181218218..., the code just copies the value from Column3 8950 times???

I am not certain why this is?
Reply
#8
In your sample dataframe all values are strings, e.g. "218" vs 218. Both code yield the same result.
JoeDainton123 likes this post
If you can't explain it to a six year old, you don't understand it yourself, Albert Einstein
How to Ask Questions The Smart Way: link and another link
Create MCV example
Debug small programs

Reply
#9
(Oct-17-2020, 02:44 PM)buran Wrote: In your sample dataframe all values are strings, e.g. "218" vs 218. Both code yield the same result.

Your right.

Thank Buran I really appreciate your help.
Reply
#10
I'd be interested as to whether there is a difference in performance. Do you gain or lose speed by specifying df[:,col] vs df[col]? With a 60K row dataframe the difference might be noticeable if there is one.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Peculiar pattern from printing of sets SahandJ 7 717 Dec-29-2021, 06:31 PM
Last Post: bowlofred
  How does one combine 2 data sets ? detlefschmitt 2 1,062 Sep-03-2021, 03:38 AM
Last Post: detlefschmitt
  Looping to read data in database CEC68 1 1,131 Sep-24-2020, 08:54 PM
Last Post: scidam
  comprehension for sets Skaperen 2 1,272 Aug-07-2020, 10:12 PM
Last Post: Skaperen
  Extract data from large string pzig98 1 1,497 Jul-20-2020, 12:39 AM
Last Post: Larz60+
  Moving large amount of data between MySql and Sql Server using Python ste80adr 4 2,141 Apr-24-2020, 01:24 PM
Last Post: Jeff900
  alternative to nested loops for large data set JonnyEnglish 2 1,646 Feb-19-2020, 11:26 PM
Last Post: JonnyEnglish
  Looping JSON data graham23s 1 1,491 Jul-01-2019, 09:37 PM
Last Post: nilamo
  Sort sets by item values Sergey 4 2,138 Apr-19-2019, 10:50 AM
Last Post: Sergey
  Problem with character sets Pedroski55 4 2,462 Mar-04-2019, 02:35 AM
Last Post: snippsat

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020