Python Forum
Why doesn't drop_duplicates() modify my Dataframe?
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Why doesn't drop_duplicates() modify my Dataframe?
#1
I have two files with exactly same data in it. My python reads that input file and fetch out all the necessary information and then save it in a CSV file. In short, python do data extraction from input file. The 2 rows look identical except datasource field. Marked with 2 different colours. See the attached image-1.

df2.drop_duplicates(subset=['PlodDate','PlodShift', 'ContractNo', 'RigNo', 'Department','DayType','Location','Comments','MachineHoursFrom','MachineHoursTo'], keep='last',inplace=True)
The above line has no effect. I also not included data source field in it. It is not removing any duplicates rows in the files. But, when printed to the new csv file, duplicates still remain within the day.

from pathlib import Path
import time
import parser
import argparse
import pandas as pd
import numpy as np
import warnings

warnings.filterwarnings("ignore")

parser = argparse.ArgumentParser(description="Process some integers.")

parser.add_argument("path", help="define the directory to folder/file")
parser.add_argument("--verbose", help="display processing information")

start = time.time()


def main(path_xlsm, verbose):
    if (".xlsm" in str(path_xlsm).lower()) and path_xlsm.is_file():
        xlsm_files = [Path(path_xlsm)]
    else:
        xlsm_files = list(Path(path_xlsm).glob("*.xlsm"))

    df = pd.DataFrame()

    for fn in xlsm_files:
        all_dfs = pd.read_excel(fn, sheet_name=None, header=None, engine="openpyxl")
        list_data = all_dfs.keys()
        all_dfs.pop("Lookups", None)
        all_dfs.pop("Instructions For Use", None)

        df2 = pd.DataFrame()

        for ws in list_data:  # Looping for excel sheet
            df1 = all_dfs[ws]
            new_row = {
                "PlodDate": df1.iloc[2, 8],
                "PlodShift": df1.iloc[3, 8],
                "ContractNo": df1.iloc[2, 3],
                "RigNo": df1.iloc[2, 6],
                "Department": df1.iloc[3, 6],
                "DayType": df1.iloc[0, 13],
                "Location": df1.iloc[4, 3],
                "Comments": df1.iloc[119, 0],
                "MachineHoursFrom": df1.iloc[2, 13],
                "MachineHoursTo": df1.iloc[3, 13],
                "DataSource": Path(fn.name),
            }
            df2 = df2.append(new_row, ignore_index=True)
            df2 = df2[
                [
                    "PlodDate",
                    "PlodShift",
                    "ContractNo",
                    "RigNo",
                    "Department",
                    "DayType",
                    "Location",
                    "Comments",
                    "MachineHoursFrom",
                    "MachineHoursTo",
                    "DataSource",
                ]
            ]
            df2.drop_duplicates(
                subset=[
                    "PlodDate",
                    "PlodShift",
                    "ContractNo",
                    "RigNo",
                    "Department",
                    "DayType",
                    "Location",
                    "Comments",
                    "MachineHoursFrom",
                    "MachineHoursTo",
                ],
                keep="last",
                inplace=True,
            )
            df2 = df2.replace(",", ";", regex=True)
            df2 = df2.replace("\n", " ", regex=True)
            df2 = df2.replace("\r", " ", regex=True)
            df2 = df2.replace("\t", " ", regex=True)
        df = df.append(df2)
    df.to_csv("RDPlod.csv", index=False)


if __name__ == "__main__":
    start = time.time()
    args = parser.parse_args()
    path = Path(args.path)
    verbose = args.verbose
    main(path, verbose)  # Calling Main Function
    print("Processed time:", time.time() - start)  # Total Time
Why my drop_duplicates function is not working.

Attach all files here: https://drive.google.com/drive/folders/1...sp=sharing

Attached Files

Thumbnail(s)
   

.csv   RDPlod.csv (Size: 1.56 KB / Downloads: 230)
.py   rdplod.py (Size: 3.06 KB / Downloads: 228)
Reply
#2
Read the documentation. This function either creates a new dataframe with the duplicates removed, or it modifies the existing dataframe depending on the valus of the inplace argument.

https://pandas.pydata.org/pandas-docs/st...cates.html
Reply


Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020