Python Forum

I want to process data through a python that has 2 million rows and more than 100 columns. My code takes 20 minutes to create an output file. I don't know if there is something else that make my code faster, or if I can change something to make it faster. Any help would be greatly appreciated!

df2 = pd.DataFrame()
    for fn in csv_files:  # Looping Over CSV Files
        all_dfs = pd.read_csv(fn, header=None)

        # Finding non-null columns
        non_null_columns = [col for col in all_dfs.columns if all_dfs.loc[:, col].notna().any()]

        # print(non_null_columns)
        for i in range(0, len(all_dfs)):  # Row Loop
            SourceFile = ""
            RowNumber = ""
            ColumnNumber = ""
            Value = ""
            for j in range(0, len(non_null_columns)):  # Column Loop
                SourceFile = Path(fn.name)
                RowNumber = i+1
                ColumnNumber = j+1
                Value = all_dfs.iloc[i, j]
                df2 = df2.append(pd.DataFrame({
                    "SourceFile": [SourceFile],
                    "RowNumber": [RowNumber],
                    "ColumnNumber": [ColumnNumber],
                    "Value": [Value]
                }), ignore_index=True)
                # print(df2)
    df2['Value'].replace('', np.nan, inplace=True)  # Removing Null Value
    df2.dropna(subset=['Value'], inplace=True)
    df2.to_csv(os.path.join(path_save, f"Compiled.csv"), index=False)
    print("Output: Compiled.csv")

Attach python code.

What type of data are you dealing with in the original csv file? pure numbers? strings? both? The

Appending is costly, and maybe loops can be avoided using vectorisation if data are numbers.

I expect that you are paging memory.
How much memory do you have?
What paul18fr states about appending is true and should be avoided.
Do you need to have everything resident at the same time?

(Dec-27-2021, 12:22 PM)paul18fr Wrote: [ -> ]What type of data are you dealing with in the original csv file? pure numbers? strings? both? The

Appending is costly, and maybe loops can be avoided using vectorisation if data are numbers.

It consists of a string, number and a date.

(Dec-27-2021, 11:18 PM)Larz60+ Wrote: [ -> ]I expect that you are paging memory.
How much memory do you have?
What paul18fr states about appending is true and should be avoided.
Do you need to have everything resident at the same time?

I use a very powerful PC RAM:24GB, HardDisk:250GB and i7 processor. Can you tell me what I need to use if the appending function is costly? Is there any way we can make a loop faster?

untested, but close:

import pandas as pd
import glob

path = Your csv file path
os.path.join(path, "*.csv")
filelist = glob.glob(path + "/*.csv")

df = pd.concat((pd.read_csv(f) for f in filelist))
df = df.fillna('') # replace nan

(Dec-28-2021, 02:23 AM)Larz60+ Wrote: [ -> ]untested, but close:

import pandas as pd
import glob

path = Your csv file path
os.path.join(path, "*.csv")
filelist = glob.glob(path + "/*.csv")

df = pd.concat((pd.read_csv(f) for f in filelist))
df = df.fillna('') # replace nan

I have attached test.csv file for testing.

Please run tests and report results.

shantanu97

paul18fr

Larz60+

shantanu97

shantanu97

Larz60+

shantanu97

Larz60+