My python code is running very slow on millions of records

shantanu97 · (This post was last modified: Dec-28-2021, 01:03 AM by shantanu97.)

I want to process data through a python that has 2 million rows and more than 100 columns. My code takes 20 minutes to create an output file. I don't know if there is something else that make my code faster, or if I can change something to make it faster. Any help would be greatly appreciated!

df2 = pd.DataFrame()
    for fn in csv_files:  # Looping Over CSV Files
        all_dfs = pd.read_csv(fn, header=None)

        # Finding non-null columns
        non_null_columns = [col for col in all_dfs.columns if all_dfs.loc[:, col].notna().any()]

        # print(non_null_columns)
        for i in range(0, len(all_dfs)):  # Row Loop
            SourceFile = ""
            RowNumber = ""
            ColumnNumber = ""
            Value = ""
            for j in range(0, len(non_null_columns)):  # Column Loop
                SourceFile = Path(fn.name)
                RowNumber = i+1
                ColumnNumber = j+1
                Value = all_dfs.iloc[i, j]
                df2 = df2.append(pd.DataFrame({
                    "SourceFile": [SourceFile],
                    "RowNumber": [RowNumber],
                    "ColumnNumber": [ColumnNumber],
                    "Value": [Value]
                }), ignore_index=True)
                # print(df2)
    df2['Value'].replace('', np.nan, inplace=True)  # Removing Null Value
    df2.dropna(subset=['Value'], inplace=True)
    df2.to_csv(os.path.join(path_save, f"Compiled.csv"), index=False)
    print("Output: Compiled.csv")

Attach python code.

paul18fr · (This post was last modified: Dec-27-2021, 12:23 PM by paul18fr.)

What type of data are you dealing with in the original csv file? pure numbers? strings? both? The

Appending is costly, and maybe loops can be avoided using vectorisation if data are numbers.

**Larz60+** · Dec-27-2021, 11:18 PM

I expect that you are paging memory.
How much memory do you have?
What paul18fr states about appending is true and should be avoided.
Do you need to have everything resident at the same time?

shantanu97 · Dec-28-2021, 12:50 AM

(Dec-27-2021, 12:22 PM)paul18fr Wrote: What type of data are you dealing with in the original csv file? pure numbers? strings? both? The

Appending is costly, and maybe loops can be avoided using vectorisation if data are numbers.

It consists of a string, number and a date.

shantanu97 · Dec-28-2021, 12:54 AM

(Dec-27-2021, 11:18 PM)Larz60+ Wrote: I expect that you are paging memory.
How much memory do you have?
What paul18fr states about appending is true and should be avoided.
Do you need to have everything resident at the same time?

I use a very powerful PC RAM:24GB, HardDisk:250GB and i7 processor. Can you tell me what I need to use if the appending function is costly? Is there any way we can make a loop faster?

**Larz60+** · Dec-28-2021, 02:23 AM

untested, but close:

import pandas as pd
import glob

path = Your csv file path
os.path.join(path, "*.csv")
filelist = glob.glob(path + "/*.csv")

df = pd.concat((pd.read_csv(f) for f in filelist))
df = df.fillna('') # replace nan

shantanu97 · Dec-28-2021, 02:34 AM

(Dec-28-2021, 02:23 AM)Larz60+ Wrote: untested, but close:

import pandas as pd
import glob

path = Your csv file path
os.path.join(path, "*.csv")
filelist = glob.glob(path + "/*.csv")

df = pd.concat((pd.read_csv(f) for f in filelist))
df = df.fillna('') # replace nan

I have attached test.csv file for testing.

**Larz60+** · Dec-28-2021, 11:02 AM

Please run tests and report results.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	problem in running a code	akbarza	7	642	Feb-14-2024, 02:57 PM Last Post: snippsat
	writing and running code in vscode without saving it	akbarza	1	385	Jan-11-2024, 02:59 PM Last Post: deanhystad
	the order of running code in a decorator function	akbarza	2	522	Nov-10-2023, 08:09 AM Last Post: akbarza
	validate large json file with millions of records in batches	herobpv	3	1,275	Dec-10-2022, 10:36 PM Last Post: bowlofred
	How to retrieve records in a DataFrame (Python/Pandas) that contains leading or trail	mmunozjr	3	1,753	Sep-05-2022, 11:56 AM Last Post: Pedroski55
	Code running many times nad not just one?	korenron	4	1,361	Jul-24-2022, 08:12 AM Last Post: korenron
	Error while running code on VSC	maiya	4	3,744	Jul-01-2022, 02:51 PM Last Post: maiya
	code running for more than an hour now, yet didn't get any result, what should I do?	aiden	2	1,504	Apr-06-2022, 03:41 PM Last Post: Gribouillis
	Why is this Python code running twice?	mcva	5	5,279	Feb-02-2022, 10:21 AM Last Post: mcva
	Python keeps running the old version of the code	quest	2	3,764	Jan-20-2022, 07:34 AM Last Post: ThiefOfTime

My python code is running very slow on millions of records

User Panel Messages

Announcements