Difficulty in adapting duplicates-filter in script

ledgreve · (This post was last modified: Jul-17-2023, 03:48 PM by ledgreve.)

Hello everyone,

I have a script to parse and extract annotated data from xmi-files. I did not write this script myself (person who did is not reachable), have no coding/programming background and really need a solution asap. I've tried adapting the script with the use of ChatGPT code-interpreter, but I've kind of gotten stuck.

I annotated data on three levels: aspects, polarity triggers and named entities. The problem is related to the extraction of the aspects. I've just noticed that the script only extracts part of the aspects and dismisses what it deems to be "duplicates" based on its target word(group) and (I assume) aspect label. However, repetition in common in my data, so a lot of aspects disappeared, meaning that my data is incomplete. Ideally, there would thus be an addition marker to check whether it is actually a duplicate or not.

There are 4 scripts, I will try to include them in full with an example of one of the xmi-files:

parsexml_may2021_cvh3.py (main script)
AnnotationsObject.py
Casobject.py
Documentobject.py

The script is pretty long, so I'll only paste the most relevant part of the code related to the aspect extraction here:

def write_annotations_aspectcategory(allAnnotationsDict, annotatornames, outfilefolder):
	allaspectsdict = get_all_aspects(allAnnotationsDict)
	aspecttermslist =[]
	'''
	Gets, for every unique aspect (allaspectslist), its subcategory given by each annotator
	'''
	with open(os.path.join(outfilefolder,'annotations_aspectcategory.csv'), 'w') as csvfile:
		outfilewriter = csv.writer(csvfile, delimiter='\t', quotechar='|', quoting=csv.QUOTE_MINIMAL)
		outfilewriter.writerow(['Document', 'Aspect'] + [a for a in annotatornames] + ['Sentence'])
		for documentname, annotatordict in allAnnotationsDict.items():
			#Get a list of all aspects per document, consider the first annotator in the dict, as the aspects are the same for each annotator.
			allaspects = allaspectsdict[documentname][annotatornames[0]]
			allaspectstrings = ['+'.join(x.aspecttext) for x in allaspects] #Aspects can be stored as lists of two non-consecutive spans
			allaspectsentences = [x.sentence for x in allaspects]
			# allaspectsentenceswithoutaspect = [x.sentencewithoutaspect for x in allaspects]
			seen = []
			for aspectText, aspectSent in zip(allaspectstrings, allaspectsentences):
				aspecttermslist.append(aspectText)
				rowelements = [] #Rowelements will be written to output file
				rowelements.append(documentname)
				rowelements.append(aspectText) #rowelements.append(aspectText.replace('\xad', ''))
				for a in annotatornames:
					cat = []
					aspectobjectslist = allaspectsdict[documentname][a]
					for aspobj in aspectobjectslist:
						found = False
						if '+'.join(aspobj.aspecttext) == aspectText and aspobj.sentence == aspectSent:
							if aspobj.category == None:
								cat.append('None')
							else:
								cat.append(aspobj.category)
							found = True
						else:
							continue
						if not found:
							print('Warning: aspect\t{0}\t{1}\tfrom annotator {2} not found in allaspectslist.'.format(aspobj.aspecttext, documentname, a))
							warnings.append('Write annotations for aspect polarities: aspect\t{0}\t{1}\tfrom annotator {2} not found in allaspectslist.'.format(aspobj.aspecttext, documentname, a))
					if len(cat) > 1 and 'None' in cat: #Aspect polarity can be 'None' if the Aspect is part of a linked aspect span and its polarity was added to the other part
						indx_none = pol.index('None') #Remove 'None' polarities
						cat.pop(indx_none)
					rowelements.append(','.join(list(set(cat))))
				rowelements.append(aspectSent) #rowelements.append(aspectSent.replace('\xad', ''))
				# rowelements.append(tuple([x.replace('\xad', '') for x in aspectSentWith]))
				if not (aspectText,aspectSent) in seen: #Discard 100% duplicates (i.e. aspects that are annotated twice in the sentence)
					outfilewriter.writerow(rowelements)
				seen.append((aspectText,aspectSent))
	with open(os.path.join(outfilefolder, 'aspectTerms.txt'), 'w', encoding = 'utf-8') as f:
		for el in aspecttermslist:
			f.write(el + '\n')

This appears to be the part which specifically removes the duplicates:

				if not (aspectText,aspectSent) in seen: #Discard 100% duplicates (i.e. aspects that are annotated twice in the sentence)
					outfilewriter.writerow(rowelements)
				seen.append((aspectText,aspectSent))

To make it a bit more concrete (and offer a way to check), for the xmi-file I've included, the script only extracts 13 aspects with this label "CONTENDER_General", but there are supposed to be 53. Completely removing the duplicates filter is not a solution, because I've noticed that there are actual duplicates as well and I cannot filter them out because it's impossible to discern which one is truly repeated or is an actual duplicate. It would mean a lot if someone would be willing to help!

If you take a look at the xmi-file, it shows that each aspect has a unique ID as well as beginning and end-positition:

    <custom:Aspects xmi:id="55041" sofa="1" begin="1" end="8" FeatureCategory="CONTENDER_General"/>
    <custom:Aspects xmi:id="58618" sofa="1" begin="38" end="54" FeatureCategory="ONSITE-AUDIENCE_General"/>
    <custom:Aspects xmi:id="55046" sofa="1" begin="90" end="93" FeatureCategory="CONTENDER_General"/>
    <custom:Aspects xmi:id="55051" sofa="1" begin="102" end="109" FeatureCategory="CONTENDER_General"/>

In theory it should thus be possible to let it only dismiss only those doubles which also share either the same ID or the same span. However, I do not have the coding skill (I'm already happy if I understand what a sript does) and it seems to be too complicated for ChatGPT (it was able to help me remove the duplicate filter entirely, but that did not solve the problem).

In order to run the script on the file, they all need to be saved in the same folder with the xmi-file in a subfolder named "annotation" and then another subfolder within this ending in "_xmi" with the xmi-file in it (apologies for this, the forum won't let me simply upload the ready to use folder). Create another subfolder for the output ("OUTPUT"). Then run this command in the main folder in Powershell:

python parsexml_may2021_cvh3.py annotation/ lore no OUTPUT

I would be very grateful if someone would be willing to take a look and help, thank you in advance!

PS: apparently I cannot add a xmi-file and it is too long to include here (ca. 5000 lines) or to upload as a csv. iNot sure how to solve this problem. My apologies for the inconvenience!

Edit: added an attachment

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Having difficulty with threads and input()	sawtooth500	13	503	Jun-07-2024, 08:40 AM Last Post: Gribouillis
	Difficulty with installation	standenman	2	1,092	May-03-2023, 06:39 PM Last Post: snippsat
	Difficulty with installation	standenman	0	725	May-02-2023, 08:33 PM Last Post: standenman
	remove partial duplicates from csv	ledgreve	0	894	Dec-12-2022, 04:21 PM Last Post: ledgreve
	Removal of duplicates	teebee891	1	1,883	Feb-01-2021, 12:06 PM Last Post: jefsummers
	Displaying duplicates in dictionary	lokesh	2	2,088	Oct-15-2020, 08:07 AM Last Post: DeaD_EyE
	Difficulty in understanding transpose with a tuple of axis numbers in 3-D	new_to_python	0	1,594	Feb-11-2020, 06:03 AM Last Post: new_to_python
	Difficulty installing Pycrypto	KipCarter	4	12,890	Feb-10-2020, 07:54 PM Last Post: snippsat
	Deleting duplicates in tuples	Den	2	2,841	Dec-14-2019, 10:32 PM Last Post: ichabod801
	Help adapting code	pacaeiro	0	1,826	Apr-25-2018, 10:13 AM Last Post: pacaeiro

Difficulty in adapting duplicates-filter in script

User Panel Messages

Announcements