Looping Through Large Data Sets - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: General Coding Help (https://python-forum.io/forum-8.html) +--- Thread: Looping Through Large Data Sets (/thread-30063.html) Pages:
1
2
|
Looping Through Large Data Sets - JoeDainton123 - Oct-02-2020 Hello all I have a csv file with over 60,000 rows of data which i have imported into a dataframe, I want to perform varies calculations on this dataframe. I am looping through the data using a while loop and a counter. At each iteration of the counter i am performing some arithmetic, the code i have is as follows:- q = 0 while q < len(df_raw_data): df_raw_data.iloc[q,5] = (df_raw_data.iloc[q,3] * 8950) + (df_raw_data.iloc[q,4]) q = q + 1This code works and i fully understand it however it takes ages and i wanted to know if there was a better way of loop through large data sets. Thank you RE: Looping Through Large Data Sets - buran - Oct-02-2020 you don't work like this in pandas your code can be replaced with df_raw_data[5] = df_raw_data[3] * 8950 + df_raw_data[4]sample with dummy df: import pandas as pd df = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]]) print(df) df[3] = df[1]*5 + df[2] print(df) RE: Looping Through Large Data Sets - JoeDainton123 - Oct-02-2020 Hi buran Thank you for your reply. I really like your way however when i attempt this i get an error. The code i am using is as follows:- df_raw_data[5] = df_raw_data[3] * 1760 + df_raw_data[4]The error i get is:- df_raw_data[5] = df_raw_data[3] * 1760 + df_raw_data[4] Traceback (most recent call last): File "C:\Anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 2889, in get_loc return self._engine.get_loc(casted_key) File "pandas\_libs\index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc File "pandas\_libs\index.pyx", line 97, in pandas._libs.index.IndexEngine.get_loc File "pandas\_libs\hashtable_class_helper.pxi", line 1675, in pandas._libs.hashtable.PyObjectHashTable.get_item File "pandas\_libs\hashtable_class_helper.pxi", line 1683, in pandas._libs.hashtable.PyObjectHashTable.get_item KeyError: 3 The above exception was the direct cause of the following exception: Traceback (most recent call last): File "<ipython-input-5-d11bee5b17a9>", line 1, in <module> df_raw_data[5] = df_raw_data[3] * 1760 + df_raw_data[4] File "C:\Anaconda3\lib\site-packages\pandas\core\frame.py", line 2902, in __getitem__ indexer = self.columns.get_loc(key) File "C:\Anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 2891, in get_loc raise KeyError(key) from err KeyError: 3I can make it work using the following code:- df_raw_data.iloc[:,5] = df_raw_data.iloc[:,3] * 1760 + df_raw_data.iloc[:,4]But i am not sure why i keep getting an error? RE: Looping Through Large Data Sets - buran - Oct-03-2020 can you provide sample dataframe (no need of 60 000 rows, just 2-3 rows is sufficient) Ideal would be to provide a code that creates a DataFrame with the same structure as yours - like I did in my example which works. RE: Looping Through Large Data Sets - JoeDainton123 - Oct-16-2020 (Oct-03-2020, 04:06 AM)buran Wrote: can you provide sample dataframe (no need of 60 000 rows, just 2-3 rows is sufficient) Hi buran My apologies for not getting back to you sooner. I have tried several ways to get your method of looping through the data however I still cannot get it to work and I dont understand why. I have re-created the data set along with the code you suggested which is:- import pandas raw_data = pandas.DataFrame(columns=["Column_0", "Column_1", "Column_2", "Column_3", "Column_4", "Column_5", "Column_6", "Column_7", "Column_8"]) raw_data.loc[0,:] = ("315589", "CHZ3", "1100", "218", "694.63", "nan", "-1.589", "0", "1.3694") raw_data.loc[1,:] = ("364048", "CHZ3", "1100", "320", "12.09", "nan", "-7.216", "0", "59.89") raw_data[5] = raw_data[3] * 8950 + raw_data[4] #raw_data.iloc[:,5] = raw_data.iloc[:,3] * 8950 + raw_data.iloc[:,4]If you run the code as it is you will see that i get an error if i follow your method, however if comment out the code you suggested and uncomment out the last line it works fine. But i want to understand why your suggestion does not work? I think i am going crazy? Thank you. RE: Looping Through Large Data Sets - buran - Oct-17-2020 Your dataframe has column names, so raw_data['Column_5'] = raw_data['Column_3'] * 8950 + raw_data['Column_4'] RE: Looping Through Large Data Sets - JoeDainton123 - Oct-17-2020 Hi Buran Thank you for your response. I think I understand, if no column names are specified in the dataframe then the values in the square brackets are the column index numbers. If column names are specified then the name of the column needs to be included inside the square brackets - would this be correct? I personally prefer to use row and column index numbers through the iloc command:- df_raw_data.iloc[:,5] = df_raw_data.iloc[:,3] * 1760 + df_raw_data.iloc[:,4]But i guess both achieve the same thing. Although the change you suggested which was to add the column names works:- raw_data['Column_5'] = raw_data['Column_3'] * 8950 + raw_data['Column_4']This works however the arithmetic does not for example the values in column 5 is 2182181218218..., the code just copies the value from Column3 8950 times??? I am not certain why this is? RE: Looping Through Large Data Sets - buran - Oct-17-2020 In your sample dataframe all values are strings, e.g. "218" vs 218 . Both code yield the same result.
RE: Looping Through Large Data Sets - JoeDainton123 - Oct-17-2020 (Oct-17-2020, 02:44 PM)buran Wrote: In your sample dataframe all values are strings, e.g. Your right. Thank Buran I really appreciate your help. RE: Looping Through Large Data Sets - jefsummers - Oct-18-2020 I'd be interested as to whether there is a difference in performance. Do you gain or lose speed by specifying df[:,col] vs df[col]? With a 60K row dataframe the difference might be noticeable if there is one. |