remove partial duplicates from csv

ledgreve · Dec-12-2022, 04:21 PM

Hello,

Firstly, I wish to mention and stress that I do not have a lot of experience with Python and programming in general.

The situation: I have a csv consisting of annotated data to be used for ABSA and Automatic Aspect Term Extraction. The first column ("col1") contains the name of the annotated file, the second column ("col2") contains the tagged aspect and the third column ("col3") contains the category tag given to the aspect. The columns are separated by tabs (as sometimes the aspect contains a comma).

Rather frequently, the same target words are tagged with different aspects or there is at least some overlap.

Example:

sentence = "It's time for the annual London Book Fair, lovely weather btw"

col1|col2|col3
document_1|London Book Fair|EVENT
document_1|London|LOCATION
document_1|weather|WEATHER

To train the system for Automatic Aspect Term Extraction on the annotated data, there can be no overlap. So only one of the first two rows should remain, it does not matter which one.

Goal: I wish to remove those rows where the value for the first column is an exact duplicate and where the value of the second column is at the same time a partial duplicate (third column can be ignored).

failing code: This is the script I wrote using pandas to try and accomplish this, but unfortunately it does not seem to work at all.

This is my code (I apologise if it's weird, it is the result of massive amounts of googling, as I have little experience):

import pandas as pd
# load the CSV file into a pandas dataframe
df = pd.read_csv("C:/Users/.../annotations_aspectcategory_tab.csv", sep="\t")
# create a new dataframe with only the unique rows
new_df = df.drop_duplicates(subset=["Document"], keep=False)
# loop through the rows of the dataframe
for i, row in new_df.iterrows():
    # get the values of the "Document" and "Aspect" columns
    document = row["Document"]
    aspect = row["Aspect"]
    
    # get the rows with the same "Document" value as the current row
    same_document = new_df[new_df["Document"] == document]
    
    # loop through the rows with the same "Document" value
    for j, other_row in same_document.iterrows():
        # get the "Aspect" value of the other row
        other_aspect = other_row["Aspect"]
        
        # check if the "Aspect" value of the other row is contained in the "Aspect" value of the current row
        if other_aspect in aspect:
            # if it is, print the values and drop the other row from the dataframe
            print(f"Duplicate found: Document: {document}, Aspect: {aspect}, Other Aspect: {other_aspect}")
            new_df.drop(j, inplace=True)
# export the new dataframe as a CSV file
new_df.to_csv("C:/Users/.../unique_file.csv", index=False, sep="\t")

Any advice (preferably in very simple layman terms) on how to solve this problem or concrete help would be very much appreciated! Thank you in advance!

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	remove duplicates from dicts with list values	wardancer84	27	4,760	May-27-2024, 04:54 PM Last Post: wardancer84
	partial functions before knowing the values	mikisDeWitte	4	1,492	Dec-24-2023, 10:00 AM Last Post: perfringo
	Move Files based on partial Match	mohamedsalih12	2	2,444	Sep-20-2023, 07:38 PM Last Post: snippsat
	Partial KEY search in dict	klatlap	6	2,427	Mar-28-2023, 07:24 AM Last Post: buran
	Webhook, post_data, GPIO partial changes	DigitalID	2	1,772	Nov-10-2022, 09:50 PM Last Post: deanhystad
	Optimal way to search partial correspondence in a large dict	genny92c	0	1,473	Apr-22-2022, 10:20 AM Last Post: genny92c
	Partial Matching Rows In Pandas DataFrame Query	eddywinch82	1	3,184	Jul-08-2021, 06:32 PM Last Post: eddywinch82
	Removal of duplicates	teebee891	1	2,307	Feb-01-2021, 12:06 PM Last Post: jefsummers
	Displaying duplicates in dictionary	lokesh	2	2,637	Oct-15-2020, 08:07 AM Last Post: DeaD_EyE
	Partial key lookup in dictionary	GaryNR	1	4,415	Jul-16-2020, 06:55 PM Last Post: Gribouillis

remove partial duplicates from csv

User Panel Messages

Announcements