Python Forum
remove partial duplicates from csv
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
remove partial duplicates from csv
#1
Hello,

Firstly, I wish to mention and stress that I do not have a lot of experience with Python and programming in general.

The situation: I have a csv consisting of annotated data to be used for ABSA and Automatic Aspect Term Extraction. The first column ("col1") contains the name of the annotated file, the second column ("col2") contains the tagged aspect and the third column ("col3") contains the category tag given to the aspect. The columns are separated by tabs (as sometimes the aspect contains a comma).

Rather frequently, the same target words are tagged with different aspects or there is at least some overlap.

Example:

sentence = "It's time for the annual London Book Fair, lovely weather btw"

col1|col2|col3
document_1|London Book Fair|EVENT
document_1|London|LOCATION
document_1|weather|WEATHER


To train the system for Automatic Aspect Term Extraction on the annotated data, there can be no overlap. So only one of the first two rows should remain, it does not matter which one.

Goal: I wish to remove those rows where the value for the first column is an exact duplicate and where the value of the second column is at the same time a partial duplicate (third column can be ignored).

failing code: This is the script I wrote using pandas to try and accomplish this, but unfortunately it does not seem to work at all.

This is my code (I apologise if it's weird, it is the result of massive amounts of googling, as I have little experience):


import pandas as pd
# load the CSV file into a pandas dataframe
df = pd.read_csv("C:/Users/.../annotations_aspectcategory_tab.csv", sep="\t")
# create a new dataframe with only the unique rows
new_df = df.drop_duplicates(subset=["Document"], keep=False)
# loop through the rows of the dataframe
for i, row in new_df.iterrows():
    # get the values of the "Document" and "Aspect" columns
    document = row["Document"]
    aspect = row["Aspect"]
    
    # get the rows with the same "Document" value as the current row
    same_document = new_df[new_df["Document"] == document]
    
    # loop through the rows with the same "Document" value
    for j, other_row in same_document.iterrows():
        # get the "Aspect" value of the other row
        other_aspect = other_row["Aspect"]
        
        # check if the "Aspect" value of the other row is contained in the "Aspect" value of the current row
        if other_aspect in aspect:
            # if it is, print the values and drop the other row from the dataframe
            print(f"Duplicate found: Document: {document}, Aspect: {aspect}, Other Aspect: {other_aspect}")
            new_df.drop(j, inplace=True)
# export the new dataframe as a CSV file
new_df.to_csv("C:/Users/.../unique_file.csv", index=False, sep="\t")
Any advice (preferably in very simple layman terms) on how to solve this problem or concrete help would be very much appreciated! Thank you in advance!
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  partial functions before knowing the values mikisDeWitte 4 617 Dec-24-2023, 10:00 AM
Last Post: perfringo
  Move Files based on partial Match mohamedsalih12 2 834 Sep-20-2023, 07:38 PM
Last Post: snippsat
  Partial KEY search in dict klatlap 6 1,291 Mar-28-2023, 07:24 AM
Last Post: buran
  Webhook, post_data, GPIO partial changes DigitalID 2 997 Nov-10-2022, 09:50 PM
Last Post: deanhystad
  Optimal way to search partial correspondence in a large dict genny92c 0 1,005 Apr-22-2022, 10:20 AM
Last Post: genny92c
  Problem : Count the number of Duplicates NeedHelpPython 3 4,394 Dec-16-2021, 06:53 AM
Last Post: Gribouillis
  Partial Matching Rows In Pandas DataFrame Query eddywinch82 1 2,383 Jul-08-2021, 06:32 PM
Last Post: eddywinch82
  Removal of duplicates teebee891 1 1,803 Feb-01-2021, 12:06 PM
Last Post: jefsummers
  Displaying duplicates in dictionary lokesh 2 1,999 Oct-15-2020, 08:07 AM
Last Post: DeaD_EyE
  Partial key lookup in dictionary GaryNR 1 3,462 Jul-16-2020, 06:55 PM
Last Post: Gribouillis

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020