Posts: 61
Threads: 43
Joined: Jul 2020
Hello all
I have a csv file with over 60,000 rows of data which i have imported into a dataframe, I want to perform varies calculations on this dataframe.
I am looping through the data using a while loop and a counter.
At each iteration of the counter i am performing some arithmetic, the code i have is as follows:-
q = 0
while q < len(df_raw_data):
df_raw_data.iloc[q,5] = (df_raw_data.iloc[q,3] * 8950) + (df_raw_data.iloc[q,4])
q = q + 1 This code works and i fully understand it however it takes ages and i wanted to know if there was a better way of loop through large data sets.
Thank you
Posts: 8,160
Threads: 160
Joined: Sep 2016
Oct-02-2020, 07:07 PM
(This post was last modified: Oct-02-2020, 07:08 PM by buran.)
you don't work like this in pandas
your code can be replaced with
df_raw_data[5] = df_raw_data[3] * 8950 + df_raw_data[4] sample with dummy df:
import pandas as pd
df = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(df)
df[3] = df[1]*5 + df[2]
print(df)
Posts: 61
Threads: 43
Joined: Jul 2020
Oct-02-2020, 09:27 PM
(This post was last modified: Oct-02-2020, 09:28 PM by JoeDainton123.)
Hi buran
Thank you for your reply.
I really like your way however when i attempt this i get an error.
The code i am using is as follows:-
df_raw_data[5] = df_raw_data[3] * 1760 + df_raw_data[4] The error i get is:-
df_raw_data[5] = df_raw_data[3] * 1760 + df_raw_data[4]
Traceback (most recent call last):
File "C:\Anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 2889, in get_loc
return self._engine.get_loc(casted_key)
File "pandas\_libs\index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\index.pyx", line 97, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\hashtable_class_helper.pxi", line 1675, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas\_libs\hashtable_class_helper.pxi", line 1683, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 3
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<ipython-input-5-d11bee5b17a9>", line 1, in <module>
df_raw_data[5] = df_raw_data[3] * 1760 + df_raw_data[4]
File "C:\Anaconda3\lib\site-packages\pandas\core\frame.py", line 2902, in __getitem__
indexer = self.columns.get_loc(key)
File "C:\Anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 2891, in get_loc
raise KeyError(key) from err
KeyError: 3 I can make it work using the following code:-
df_raw_data.iloc[:,5] = df_raw_data.iloc[:,3] * 1760 + df_raw_data.iloc[:,4] But i am not sure why i keep getting an error?
Posts: 8,160
Threads: 160
Joined: Sep 2016
can you provide sample dataframe (no need of 60 000 rows, just 2-3 rows is sufficient)
Ideal would be to provide a code that creates a DataFrame with the same structure as yours - like I did in my example which works.
JoeDainton123 likes this post
Posts: 61
Threads: 43
Joined: Jul 2020
(Oct-03-2020, 04:06 AM)buran Wrote: can you provide sample dataframe (no need of 60 000 rows, just 2-3 rows is sufficient)
Ideal would be to provide a code that creates a DataFrame with the same structure as yours - like I did in my example which works.
Hi buran
My apologies for not getting back to you sooner.
I have tried several ways to get your method of looping through the data however I still cannot get it to work and I dont understand why.
I have re-created the data set along with the code you suggested which is:-
import pandas
raw_data = pandas.DataFrame(columns=["Column_0", "Column_1", "Column_2", "Column_3", "Column_4", "Column_5", "Column_6", "Column_7", "Column_8"])
raw_data.loc[0,:] = ("315589", "CHZ3", "1100", "218", "694.63", "nan", "-1.589", "0", "1.3694")
raw_data.loc[1,:] = ("364048", "CHZ3", "1100", "320", "12.09", "nan", "-7.216", "0", "59.89")
raw_data[5] = raw_data[3] * 8950 + raw_data[4]
#raw_data.iloc[:,5] = raw_data.iloc[:,3] * 8950 + raw_data.iloc[:,4] If you run the code as it is you will see that i get an error if i follow your method, however if comment out the code you suggested and uncomment out the last line it works fine.
But i want to understand why your suggestion does not work?
I think i am going crazy?
Thank you.
Posts: 8,160
Threads: 160
Joined: Sep 2016
Your dataframe has column names, so
raw_data['Column_5'] = raw_data['Column_3'] * 8950 + raw_data['Column_4']
Posts: 61
Threads: 43
Joined: Jul 2020
Hi Buran
Thank you for your response.
I think I understand, if no column names are specified in the dataframe then the values in the square brackets are the column index numbers.
If column names are specified then the name of the column needs to be included inside the square brackets - would this be correct?
I personally prefer to use row and column index numbers through the iloc command:-
df_raw_data.iloc[:,5] = df_raw_data.iloc[:,3] * 1760 + df_raw_data.iloc[:,4] But i guess both achieve the same thing.
Although the change you suggested which was to add the column names works:-
raw_data['Column_5'] = raw_data['Column_3'] * 8950 + raw_data['Column_4'] This works however the arithmetic does not for example the values in column 5 is 2182181218218..., the code just copies the value from Column3 8950 times???
I am not certain why this is?
Posts: 8,160
Threads: 160
Joined: Sep 2016
In your sample dataframe all values are strings, e.g. "218" vs 218 . Both code yield the same result.
JoeDainton123 likes this post
Posts: 61
Threads: 43
Joined: Jul 2020
(Oct-17-2020, 02:44 PM)buran Wrote: In your sample dataframe all values are strings, e.g. "218" vs 218 . Both code yield the same result.
Your right.
Thank Buran I really appreciate your help.
Posts: 1,358
Threads: 2
Joined: May 2019
I'd be interested as to whether there is a difference in performance. Do you gain or lose speed by specifying df[:,col] vs df[col]? With a 60K row dataframe the difference might be noticeable if there is one.
|