Python Forum
My python code is running very slow on millions of records
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
My python code is running very slow on millions of records
#1
I want to process data through a python that has 2 million rows and more than 100 columns. My code takes 20 minutes to create an output file. I don't know if there is something else that make my code faster, or if I can change something to make it faster. Any help would be greatly appreciated!

df2 = pd.DataFrame()
    for fn in csv_files:  # Looping Over CSV Files
        all_dfs = pd.read_csv(fn, header=None)

        # Finding non-null columns
        non_null_columns = [col for col in all_dfs.columns if all_dfs.loc[:, col].notna().any()]

        # print(non_null_columns)
        for i in range(0, len(all_dfs)):  # Row Loop
            SourceFile = ""
            RowNumber = ""
            ColumnNumber = ""
            Value = ""
            for j in range(0, len(non_null_columns)):  # Column Loop
                SourceFile = Path(fn.name)
                RowNumber = i+1
                ColumnNumber = j+1
                Value = all_dfs.iloc[i, j]
                df2 = df2.append(pd.DataFrame({
                    "SourceFile": [SourceFile],
                    "RowNumber": [RowNumber],
                    "ColumnNumber": [ColumnNumber],
                    "Value": [Value]
                }), ignore_index=True)
                # print(df2)
    df2['Value'].replace('', np.nan, inplace=True)  # Removing Null Value
    df2.dropna(subset=['Value'], inplace=True)
    df2.to_csv(os.path.join(path_save, f"Compiled.csv"), index=False)
    print("Output: Compiled.csv")
Attach python code.

Attached Files

.py   NormalizedCSV.py (Size: 2.2 KB / Downloads: 196)
.csv   Test.csv (Size: 708 bytes / Downloads: 216)
Reply
#2
What type of data are you dealing with in the original csv file? pure numbers? strings? both? The

Appending is costly, and maybe loops can be avoided using vectorisation if data are numbers.
Reply
#3
I expect that you are paging memory.
How much memory do you have?
What paul18fr states about appending is true and should be avoided.
Do you need to have everything resident at the same time?
Reply
#4
(Dec-27-2021, 12:22 PM)paul18fr Wrote: What type of data are you dealing with in the original csv file? pure numbers? strings? both? The

Appending is costly, and maybe loops can be avoided using vectorisation if data are numbers.

It consists of a string, number and a date.
Reply
#5
(Dec-27-2021, 11:18 PM)Larz60+ Wrote: I expect that you are paging memory.
How much memory do you have?
What paul18fr states about appending is true and should be avoided.
Do you need to have everything resident at the same time?

I use a very powerful PC RAM:24GB, HardDisk:250GB and i7 processor. Can you tell me what I need to use if the appending function is costly? Is there any way we can make a loop faster?
Reply
#6
untested, but close:
import pandas as pd
import glob

path = Your csv file path
os.path.join(path, "*.csv")
filelist = glob.glob(path + "/*.csv")

df = pd.concat((pd.read_csv(f) for f in filelist))
df = df.fillna('') # replace nan
Reply
#7
(Dec-28-2021, 02:23 AM)Larz60+ Wrote: untested, but close:
import pandas as pd
import glob

path = Your csv file path
os.path.join(path, "*.csv")
filelist = glob.glob(path + "/*.csv")

df = pd.concat((pd.read_csv(f) for f in filelist))
df = df.fillna('') # replace nan

I have attached test.csv file for testing.
Reply
#8
Please run tests and report results.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  problem in running a code akbarza 7 642 Feb-14-2024, 02:57 PM
Last Post: snippsat
  writing and running code in vscode without saving it akbarza 1 385 Jan-11-2024, 02:59 PM
Last Post: deanhystad
  the order of running code in a decorator function akbarza 2 522 Nov-10-2023, 08:09 AM
Last Post: akbarza
  validate large json file with millions of records in batches herobpv 3 1,275 Dec-10-2022, 10:36 PM
Last Post: bowlofred
  How to retrieve records in a DataFrame (Python/Pandas) that contains leading or trail mmunozjr 3 1,753 Sep-05-2022, 11:56 AM
Last Post: Pedroski55
  Code running many times nad not just one? korenron 4 1,361 Jul-24-2022, 08:12 AM
Last Post: korenron
  Error while running code on VSC maiya 4 3,744 Jul-01-2022, 02:51 PM
Last Post: maiya
  code running for more than an hour now, yet didn't get any result, what should I do? aiden 2 1,504 Apr-06-2022, 03:41 PM
Last Post: Gribouillis
  Why is this Python code running twice? mcva 5 5,279 Feb-02-2022, 10:21 AM
Last Post: mcva
  Python keeps running the old version of the code quest 2 3,764 Jan-20-2022, 07:34 AM
Last Post: ThiefOfTime

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020