Dec-27-2021, 11:02 AM
(This post was last modified: Dec-28-2021, 01:03 AM by shantanu97.)
I want to process data through a python that has 2 million rows and more than 100 columns. My code takes 20 minutes to create an output file. I don't know if there is something else that make my code faster, or if I can change something to make it faster. Any help would be greatly appreciated!
df2 = pd.DataFrame() for fn in csv_files: # Looping Over CSV Files all_dfs = pd.read_csv(fn, header=None) # Finding non-null columns non_null_columns = [col for col in all_dfs.columns if all_dfs.loc[:, col].notna().any()] # print(non_null_columns) for i in range(0, len(all_dfs)): # Row Loop SourceFile = "" RowNumber = "" ColumnNumber = "" Value = "" for j in range(0, len(non_null_columns)): # Column Loop SourceFile = Path(fn.name) RowNumber = i+1 ColumnNumber = j+1 Value = all_dfs.iloc[i, j] df2 = df2.append(pd.DataFrame({ "SourceFile": [SourceFile], "RowNumber": [RowNumber], "ColumnNumber": [ColumnNumber], "Value": [Value] }), ignore_index=True) # print(df2) df2['Value'].replace('', np.nan, inplace=True) # Removing Null Value df2.dropna(subset=['Value'], inplace=True) df2.to_csv(os.path.join(path_save, f"Compiled.csv"), index=False) print("Output: Compiled.csv")Attach python code.
Attached Files