Dec-12-2022, 04:21 PM
Hello,
Firstly, I wish to mention and stress that I do not have a lot of experience with Python and programming in general.
The situation: I have a csv consisting of annotated data to be used for ABSA and Automatic Aspect Term Extraction. The first column ("col1") contains the name of the annotated file, the second column ("col2") contains the tagged aspect and the third column ("col3") contains the category tag given to the aspect. The columns are separated by tabs (as sometimes the aspect contains a comma).
Rather frequently, the same target words are tagged with different aspects or there is at least some overlap.
Example:
sentence = "It's time for the annual London Book Fair, lovely weather btw"
col1|col2|col3
document_1|London Book Fair|EVENT
document_1|London|LOCATION
document_1|weather|WEATHER
To train the system for Automatic Aspect Term Extraction on the annotated data, there can be no overlap. So only one of the first two rows should remain, it does not matter which one.
Goal: I wish to remove those rows where the value for the first column is an exact duplicate and where the value of the second column is at the same time a partial duplicate (third column can be ignored).
failing code: This is the script I wrote using pandas to try and accomplish this, but unfortunately it does not seem to work at all.
This is my code (I apologise if it's weird, it is the result of massive amounts of googling, as I have little experience):
Firstly, I wish to mention and stress that I do not have a lot of experience with Python and programming in general.
The situation: I have a csv consisting of annotated data to be used for ABSA and Automatic Aspect Term Extraction. The first column ("col1") contains the name of the annotated file, the second column ("col2") contains the tagged aspect and the third column ("col3") contains the category tag given to the aspect. The columns are separated by tabs (as sometimes the aspect contains a comma).
Rather frequently, the same target words are tagged with different aspects or there is at least some overlap.
Example:
sentence = "It's time for the annual London Book Fair, lovely weather btw"
col1|col2|col3
document_1|London Book Fair|EVENT
document_1|London|LOCATION
document_1|weather|WEATHER
To train the system for Automatic Aspect Term Extraction on the annotated data, there can be no overlap. So only one of the first two rows should remain, it does not matter which one.
Goal: I wish to remove those rows where the value for the first column is an exact duplicate and where the value of the second column is at the same time a partial duplicate (third column can be ignored).
failing code: This is the script I wrote using pandas to try and accomplish this, but unfortunately it does not seem to work at all.
This is my code (I apologise if it's weird, it is the result of massive amounts of googling, as I have little experience):
import pandas as pd # load the CSV file into a pandas dataframe df = pd.read_csv("C:/Users/.../annotations_aspectcategory_tab.csv", sep="\t") # create a new dataframe with only the unique rows new_df = df.drop_duplicates(subset=["Document"], keep=False) # loop through the rows of the dataframe for i, row in new_df.iterrows(): # get the values of the "Document" and "Aspect" columns document = row["Document"] aspect = row["Aspect"] # get the rows with the same "Document" value as the current row same_document = new_df[new_df["Document"] == document] # loop through the rows with the same "Document" value for j, other_row in same_document.iterrows(): # get the "Aspect" value of the other row other_aspect = other_row["Aspect"] # check if the "Aspect" value of the other row is contained in the "Aspect" value of the current row if other_aspect in aspect: # if it is, print the values and drop the other row from the dataframe print(f"Duplicate found: Document: {document}, Aspect: {aspect}, Other Aspect: {other_aspect}") new_df.drop(j, inplace=True) # export the new dataframe as a CSV file new_df.to_csv("C:/Users/.../unique_file.csv", index=False, sep="\t")Any advice (preferably in very simple layman terms) on how to solve this problem or concrete help would be very much appreciated! Thank you in advance!