Difficulty in adapting duplicates-filter in script

ledgreve · (This post was last modified: Jul-17-2023, 03:48 PM by ledgreve.)

Hello everyone,

I have a script to parse and extract annotated data from xmi-files. I did not write this script myself (person who did is not reachable), have no coding/programming background and really need a solution asap. I've tried adapting the script with the use of ChatGPT code-interpreter, but I've kind of gotten stuck.

I annotated data on three levels: aspects, polarity triggers and named entities. The problem is related to the extraction of the aspects. I've just noticed that the script only extracts part of the aspects and dismisses what it deems to be "duplicates" based on its target word(group) and (I assume) aspect label. However, repetition in common in my data, so a lot of aspects disappeared, meaning that my data is incomplete. Ideally, there would thus be an addition marker to check whether it is actually a duplicate or not.

There are 4 scripts, I will try to include them in full with an example of one of the xmi-files:

parsexml_may2021_cvh3.py (main script)
AnnotationsObject.py
Casobject.py
Documentobject.py

The script is pretty long, so I'll only paste the most relevant part of the code related to the aspect extraction here:

def write_annotations_aspectcategory(allAnnotationsDict, annotatornames, outfilefolder):
	allaspectsdict = get_all_aspects(allAnnotationsDict)
	aspecttermslist =[]
	'''
	Gets, for every unique aspect (allaspectslist), its subcategory given by each annotator
	'''
	with open(os.path.join(outfilefolder,'annotations_aspectcategory.csv'), 'w') as csvfile:
		outfilewriter = csv.writer(csvfile, delimiter='\t', quotechar='|', quoting=csv.QUOTE_MINIMAL)
		outfilewriter.writerow(['Document', 'Aspect'] + [a for a in annotatornames] + ['Sentence'])
		for documentname, annotatordict in allAnnotationsDict.items():
			#Get a list of all aspects per document, consider the first annotator in the dict, as the aspects are the same for each annotator.
			allaspects = allaspectsdict[documentname][annotatornames[0]]
			allaspectstrings = ['+'.join(x.aspecttext) for x in allaspects] #Aspects can be stored as lists of two non-consecutive spans
			allaspectsentences = [x.sentence for x in allaspects]
			# allaspectsentenceswithoutaspect = [x.sentencewithoutaspect for x in allaspects]
			seen = []
			for aspectText, aspectSent in zip(allaspectstrings, allaspectsentences):
				aspecttermslist.append(aspectText)
				rowelements = [] #Rowelements will be written to output file
				rowelements.append(documentname)
				rowelements.append(aspectText) #rowelements.append(aspectText.replace('\xad', ''))
				for a in annotatornames:
					cat = []
					aspectobjectslist = allaspectsdict[documentname][a]
					for aspobj in aspectobjectslist:
						found = False
						if '+'.join(aspobj.aspecttext) == aspectText and aspobj.sentence == aspectSent:
							if aspobj.category == None:
								cat.append('None')
							else:
								cat.append(aspobj.category)
							found = True
						else:
							continue
						if not found:
							print('Warning: aspect\t{0}\t{1}\tfrom annotator {2} not found in allaspectslist.'.format(aspobj.aspecttext, documentname, a))
							warnings.append('Write annotations for aspect polarities: aspect\t{0}\t{1}\tfrom annotator {2} not found in allaspectslist.'.format(aspobj.aspecttext, documentname, a))
					if len(cat) > 1 and 'None' in cat: #Aspect polarity can be 'None' if the Aspect is part of a linked aspect span and its polarity was added to the other part
						indx_none = pol.index('None') #Remove 'None' polarities
						cat.pop(indx_none)
					rowelements.append(','.join(list(set(cat))))
				rowelements.append(aspectSent) #rowelements.append(aspectSent.replace('\xad', ''))
				# rowelements.append(tuple([x.replace('\xad', '') for x in aspectSentWith]))
				if not (aspectText,aspectSent) in seen: #Discard 100% duplicates (i.e. aspects that are annotated twice in the sentence)
					outfilewriter.writerow(rowelements)
				seen.append((aspectText,aspectSent))
	with open(os.path.join(outfilefolder, 'aspectTerms.txt'), 'w', encoding = 'utf-8') as f:
		for el in aspecttermslist:
			f.write(el + '\n')

This appears to be the part which specifically removes the duplicates:

				if not (aspectText,aspectSent) in seen: #Discard 100% duplicates (i.e. aspects that are annotated twice in the sentence)
					outfilewriter.writerow(rowelements)
				seen.append((aspectText,aspectSent))

To make it a bit more concrete (and offer a way to check), for the xmi-file I've included, the script only extracts 13 aspects with this label "CONTENDER_General", but there are supposed to be 53. Completely removing the duplicates filter is not a solution, because I've noticed that there are actual duplicates as well and I cannot filter them out because it's impossible to discern which one is truly repeated or is an actual duplicate. It would mean a lot if someone would be willing to help!

If you take a look at the xmi-file, it shows that each aspect has a unique ID as well as beginning and end-positition:

    <custom:Aspects xmi:id="55041" sofa="1" begin="1" end="8" FeatureCategory="CONTENDER_General"/>
    <custom:Aspects xmi:id="58618" sofa="1" begin="38" end="54" FeatureCategory="ONSITE-AUDIENCE_General"/>
    <custom:Aspects xmi:id="55046" sofa="1" begin="90" end="93" FeatureCategory="CONTENDER_General"/>
    <custom:Aspects xmi:id="55051" sofa="1" begin="102" end="109" FeatureCategory="CONTENDER_General"/>

In theory it should thus be possible to let it only dismiss only those doubles which also share either the same ID or the same span. However, I do not have the coding skill (I'm already happy if I understand what a sript does) and it seems to be too complicated for ChatGPT (it was able to help me remove the duplicate filter entirely, but that did not solve the problem).

In order to run the script on the file, they all need to be saved in the same folder with the xmi-file in a subfolder named "annotation" and then another subfolder within this ending in "_xmi" with the xmi-file in it (apologies for this, the forum won't let me simply upload the ready to use folder). Create another subfolder for the output ("OUTPUT"). Then run this command in the main folder in Powershell:

python parsexml_may2021_cvh3.py annotation/ lore no OUTPUT

I would be very grateful if someone would be willing to take a look and help, thank you in advance!

PS: apparently I cannot add a xmi-file and it is too long to include here (ca. 5000 lines) or to upload as a csv. iNot sure how to solve this problem. My apologies for the inconvenience!

Edit: added an attachment

**deanhystad** · (This post was last modified: Jul-14-2023, 04:20 PM by deanhystad.)

Obviously this statement is a lie.

Quote:I've noticed that there are actual duplicates as well and I cannot filter them out because it's impossible to discern which one is truly repeated or is an actual duplicate

If you've noticed there are "actual duplicates", there must be some way to discern which is truly repeated and which is an actual duplicate. How did you identify the duplicates? Describing the difference between a duplicate and a repeat is the first step in designing a better filter.

You say this:

Quote:In theory it should thus be possible to let it only dismiss only those doubles which also share either the same ID or the same span.

What do you mean by ID and span? I don't see those in the posted code and I don't want to read the python files. If there is an ID, the solution might be as easy as adding this ID to the tuples saved in "seen". If span is some portion of a document, the solution might be resetting the "seen" list to empty after processing each span.

ledgreve · Jul-15-2023, 02:52 PM

Hello @deanhystad, thank you for replying. The way I identified the duplicates was by manually counting them in this annotation file (53) and then looking how many there were in the csv containing the extracted data (56). Because there were more than the actual account, the extras must be duplicates. Of course, mannually counting them in every file is not an option, it was just a way to verify whether the script worked in this case.

The ID ("xmi:id") and span (consisting of "begin" and "end" combined) are included in the example of part of the xmi-file I provided:

    <custom:Aspects xmi:id="55041" sofa="1" begin="1" end="8" FeatureCategory="CONTENDER_General"/>
    <custom:Aspects xmi:id="58618" sofa="1" begin="38" end="54" FeatureCategory="ONSITE-AUDIENCE_General"/>
    <custom:Aspects xmi:id="55046" sofa="1" begin="90" end="93" FeatureCategory="CONTENDER_General"/>
    <custom:Aspects xmi:id="55051" sofa="1" begin="102" end="109" FeatureCategory="CONTENDER_General"/>

For example, the ID of the first aspect in this short list is "55041" and its span is [1,8].

As mentioned, I do not have python coding experience myself. How could I do this?

ledgreve · Jul-15-2023, 06:49 PM

@deanhystad I have been trying to adapt the script using ChatGPT based on what you said and there is now an extra condition for the filter. Previously it looked at the aspect and sentence (though i do not think the sentence thing actually worked as it should have, otherwise most of these fake duplicates would not have been removed, with the help of ChatGPT I was able to add a third condition namely the ID I mentioned:

				rowelements.append(aspectSent)
				if not (aspectText,aspectSent, aspectID) in seen:  # check if aspectID is in seen
					outfilewriter.writerow(rowelements)
				seen.append((aspectText,aspectSent, aspectID))  # add aspectID to seen

The script is able to run, however the results are the exact same, so now I do not know whether it is the same because it did not work or because it did, but for some strange reason it still removed the same aspects that should not have been removed.

**deanhystad** · Jul-16-2023, 10:40 PM

Open up one of the xmi files. Copy two or more annotations. Paste as text.

ledgreve · Jul-17-2023, 03:46 PM

@deanhystad Of course, here you go:

    <custom:Aspects xmi:id="1384" sofa="1" begin="5" end="47" FeatureCategory="JURY_General"/>
    <custom:Aspects xmi:id="1379" sofa="1" begin="49" end="82" FeatureCategory="META_Winner_Award-Ceremony"/>
    <custom:Aspects xmi:id="1389" sofa="1" begin="93" end="103" FeatureCategory="CONTENDER_General"/>
    <custom:Aspects xmi:id="1394" sofa="1" begin="114" end="122" FeatureCategory="TEXT_General"/>

I've also found one of the xmi files that was small enough, so I've saved it as a txt and include it. I cannot add it to this reply, so I will edit my main post and add it to the attachments there. It's called "lore.txt". That way you can see the ful structure of the document if you like.

In the "AnnotationsObject.py" I added this new line:

class AspectObject:
	def __init__(self):
		self.aspecttext = []
		self.polarity = None  
		self.aspectcategory = None #--> not necessary
		self.sameclause = False
		self.sentence = None
		self.sentencerange = None
		self.sentencewithoutaspect = None #Sentence from which the aspect span itself is withdrawn
		self.aspectID = None  # NEW LINE THAT WAS ADDED

The complete function now looks like this:

def write_annotations_aspectcategory(allAnnotationsDict, annotatornames, outfilefolder):
	allaspectsdict = get_all_aspects(allAnnotationsDict)
	aspecttermslist =[]
	with open(os.path.join(outfilefolder,'annotations_aspectcategory.csv'), 'w') as csvfile:
		outfilewriter = csv.writer(csvfile, delimiter='\t', quotechar='|', quoting=csv.QUOTE_MINIMAL)
		outfilewriter.writerow(['Document', 'Aspect'] + [a for a in annotatornames] + ['Sentence'])
		for documentname, annotatordict in allAnnotationsDict.items():
			allaspects = allaspectsdict[documentname][annotatornames[0]]
			allaspectstrings = ['+'.join(x.aspecttext) for x in allaspects]
			allaspectsentences = [x.sentence for x in allaspects]
			seen = []
			for aspectObj, aspectText, aspectSent in zip(allaspects, allaspectstrings, allaspectsentences):
				aspecttermslist.append(aspectText)
				rowelements = []
				rowelements.append(documentname)
				rowelements.append(aspectText)
				aspectID = aspectObj.aspectID  # NEW LINE
				# Print debugging output
				print(f"Processing aspect: text={aspectText}, sent={aspectSent}, id={aspectID}")
				for a in annotatornames:
					cat = []
					aspectobjectslist = allaspectsdict[documentname][a]
					for aspobj in aspectobjectslist:
						found = False
						if '+'.join(aspobj.aspecttext) == aspectText and aspobj.sentence == aspectSent and aspobj.aspectID == aspectID:  # compare aspectID
							if aspobj.category == None:
								cat.append('None')
							else:
								cat.append(aspobj.category)
							found = True
						else:
							continue
						if not found:
							print('Warning: aspect\t{0}\t{1}\tfrom annotator {2} not found in allaspectslist.'.format(aspobj.aspecttext, documentname, a))
							warnings.append('Write annotations for aspect polarities: aspect\t{0}\t{1}\tfrom annotator {2} not found in allaspectslist.'.format(aspobj.aspecttext, documentname, a))
					if len(cat) > 1 and 'None' in cat: 
						indx_none = pol.index('None') 
						cat.pop(indx_none)
					rowelements.append(','.join(list(set(cat))))
				rowelements.append(aspectSent)
				if not (aspectText,aspectSent, aspectID) in seen:  # check if aspectID is in seen
					outfilewriter.writerow(rowelements)
					print(f"New aspect, writing to output: text={aspectText}, sent={aspectSent}, id={aspectID}")
				else:
					print(f"Duplicate aspect, not writing to output: text={aspectText}, sent={aspectSent}, id={aspectID}")
				seen.append((aspectText,aspectSent, aspectID))  # ADD ASPECTID TO BE SEEN
				print(f"Current contents of seen: {seen}")
	with open(os.path.join(outfilefolder, 'aspectTerms.txt'), 'w', encoding = 'utf-8') as f:
		for el in aspecttermslist:
			f.write(el + '\n')

The output given by the print station shows a "none"-value for the xmi:id, so my guess is it is simply not getting extracted. Ive tried using ChatGPT's Code interpreter to find out how I could solve this problem and get it actually extracted, however it keeps looping and inventing completely new parts of the script that do not actually exist.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Difficulty with installation	standenman	2	1,015	May-03-2023, 06:39 PM Last Post: snippsat
	Difficulty with installation	standenman	1	656	May-02-2023, 09:29 PM Last Post: sudoku6
	remove partial duplicates from csv	ledgreve	0	824	Dec-12-2022, 04:21 PM Last Post: ledgreve
	Problem : Count the number of Duplicates	NeedHelpPython	3	4,437	Dec-16-2021, 06:53 AM Last Post: Gribouillis
	Removal of duplicates	teebee891	1	1,825	Feb-01-2021, 12:06 PM Last Post: jefsummers
	Displaying duplicates in dictionary	lokesh	2	2,030	Oct-15-2020, 08:07 AM Last Post: DeaD_EyE
	Difficulty in understanding transpose with a tuple of axis numbers in 3-D	new_to_python	0	1,553	Feb-11-2020, 06:03 AM Last Post: new_to_python
	Difficulty installing Pycrypto	KipCarter	4	12,355	Feb-10-2020, 07:54 PM Last Post: snippsat
	how do i pass duplicates in my range iterator?	pseudo	3	2,401	Dec-18-2019, 03:01 PM Last Post: ichabod801
	Deleting duplicates in tuples	Den	2	2,783	Dec-14-2019, 10:32 PM Last Post: ichabod801

Difficulty in adapting duplicates-filter in script

User Panel Messages

Announcements