Python Forum
Difficulty in adapting duplicates-filter in script
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Difficulty in adapting duplicates-filter in script
#1
Hello everyone,

I have a script to parse and extract annotated data from xmi-files. I did not write this script myself (person who did is not reachable), have no coding/programming background and really need a solution asap. I've tried adapting the script with the use of ChatGPT code-interpreter, but I've kind of gotten stuck.

I annotated data on three levels: aspects, polarity triggers and named entities. The problem is related to the extraction of the aspects. I've just noticed that the script only extracts part of the aspects and dismisses what it deems to be "duplicates" based on its target word(group) and (I assume) aspect label. However, repetition in common in my data, so a lot of aspects disappeared, meaning that my data is incomplete. Ideally, there would thus be an addition marker to check whether it is actually a duplicate or not.

There are 4 scripts, I will try to include them in full with an example of one of the xmi-files:
  • parsexml_may2021_cvh3.py (main script)
  • AnnotationsObject.py
  • Casobject.py
  • Documentobject.py

The script is pretty long, so I'll only paste the most relevant part of the code related to the aspect extraction here:

def write_annotations_aspectcategory(allAnnotationsDict, annotatornames, outfilefolder):
	allaspectsdict = get_all_aspects(allAnnotationsDict)
	aspecttermslist =[]
	'''
	Gets, for every unique aspect (allaspectslist), its subcategory given by each annotator
	'''
	with open(os.path.join(outfilefolder,'annotations_aspectcategory.csv'), 'w') as csvfile:
		outfilewriter = csv.writer(csvfile, delimiter='\t', quotechar='|', quoting=csv.QUOTE_MINIMAL)
		outfilewriter.writerow(['Document', 'Aspect'] + [a for a in annotatornames] + ['Sentence'])
		for documentname, annotatordict in allAnnotationsDict.items():
			#Get a list of all aspects per document, consider the first annotator in the dict, as the aspects are the same for each annotator.
			allaspects = allaspectsdict[documentname][annotatornames[0]]
			allaspectstrings = ['+'.join(x.aspecttext) for x in allaspects] #Aspects can be stored as lists of two non-consecutive spans
			allaspectsentences = [x.sentence for x in allaspects]
			# allaspectsentenceswithoutaspect = [x.sentencewithoutaspect for x in allaspects]
			seen = []
			for aspectText, aspectSent in zip(allaspectstrings, allaspectsentences):
				aspecttermslist.append(aspectText)
				rowelements = [] #Rowelements will be written to output file
				rowelements.append(documentname)
				rowelements.append(aspectText) #rowelements.append(aspectText.replace('\xad', ''))
				for a in annotatornames:
					cat = []
					aspectobjectslist = allaspectsdict[documentname][a]
					for aspobj in aspectobjectslist:
						found = False
						if '+'.join(aspobj.aspecttext) == aspectText and aspobj.sentence == aspectSent:
							if aspobj.category == None:
								cat.append('None')
							else:
								cat.append(aspobj.category)
							found = True
						else:
							continue
						if not found:
							print('Warning: aspect\t{0}\t{1}\tfrom annotator {2} not found in allaspectslist.'.format(aspobj.aspecttext, documentname, a))
							warnings.append('Write annotations for aspect polarities: aspect\t{0}\t{1}\tfrom annotator {2} not found in allaspectslist.'.format(aspobj.aspecttext, documentname, a))
					if len(cat) > 1 and 'None' in cat: #Aspect polarity can be 'None' if the Aspect is part of a linked aspect span and its polarity was added to the other part
						indx_none = pol.index('None') #Remove 'None' polarities
						cat.pop(indx_none)
					rowelements.append(','.join(list(set(cat))))
				rowelements.append(aspectSent) #rowelements.append(aspectSent.replace('\xad', ''))
				# rowelements.append(tuple([x.replace('\xad', '') for x in aspectSentWith]))
				if not (aspectText,aspectSent) in seen: #Discard 100% duplicates (i.e. aspects that are annotated twice in the sentence)
					outfilewriter.writerow(rowelements)
				seen.append((aspectText,aspectSent))
	with open(os.path.join(outfilefolder, 'aspectTerms.txt'), 'w', encoding = 'utf-8') as f:
		for el in aspecttermslist:
			f.write(el + '\n')
This appears to be the part which specifically removes the duplicates:
				if not (aspectText,aspectSent) in seen: #Discard 100% duplicates (i.e. aspects that are annotated twice in the sentence)
					outfilewriter.writerow(rowelements)
				seen.append((aspectText,aspectSent))
To make it a bit more concrete (and offer a way to check), for the xmi-file I've included, the script only extracts 13 aspects with this label "CONTENDER_General", but there are supposed to be 53. Completely removing the duplicates filter is not a solution, because I've noticed that there are actual duplicates as well and I cannot filter them out because it's impossible to discern which one is truly repeated or is an actual duplicate. It would mean a lot if someone would be willing to help!

If you take a look at the xmi-file, it shows that each aspect has a unique ID as well as beginning and end-positition:

    <custom:Aspects xmi:id="55041" sofa="1" begin="1" end="8" FeatureCategory="CONTENDER_General"/>
    <custom:Aspects xmi:id="58618" sofa="1" begin="38" end="54" FeatureCategory="ONSITE-AUDIENCE_General"/>
    <custom:Aspects xmi:id="55046" sofa="1" begin="90" end="93" FeatureCategory="CONTENDER_General"/>
    <custom:Aspects xmi:id="55051" sofa="1" begin="102" end="109" FeatureCategory="CONTENDER_General"/>
In theory it should thus be possible to let it only dismiss only those doubles which also share either the same ID or the same span. However, I do not have the coding skill (I'm already happy if I understand what a sript does) and it seems to be too complicated for ChatGPT (it was able to help me remove the duplicate filter entirely, but that did not solve the problem).

In order to run the script on the file, they all need to be saved in the same folder with the xmi-file in a subfolder named "annotation" and then another subfolder within this ending in "_xmi" with the xmi-file in it (apologies for this, the forum won't let me simply upload the ready to use folder). Create another subfolder for the output ("OUTPUT"). Then run this command in the main folder in Powershell:
python parsexml_may2021_cvh3.py annotation/ lore no OUTPUT
I would be very grateful if someone would be willing to take a look and help, thank you in advance!





PS: apparently I cannot add a xmi-file and it is too long to include here (ca. 5000 lines) or to upload as a csv. iNot sure how to solve this problem. My apologies for the inconvenience!



Edit: added an attachment

Attached Files

.py   Documentobject.py (Size: 182 bytes / Downloads: 81)
.py   parsexml_may2021_cvh3.py (Size: 33.68 KB / Downloads: 107)
.py   AnnotationsObject.py (Size: 974 bytes / Downloads: 109)
.py   Casobject.py (Size: 95 bytes / Downloads: 86)
.txt   lore.txt (Size: 19.74 KB / Downloads: 93)
Reply


Messages In This Thread
Difficulty in adapting duplicates-filter in script - by ledgreve - Jul-14-2023, 03:49 PM

Possibly Related Threads…
Thread Author Replies Views Last Post
  Having difficulty with threads and input() sawtooth500 13 503 Jun-07-2024, 08:40 AM
Last Post: Gribouillis
  Difficulty with installation standenman 2 1,092 May-03-2023, 06:39 PM
Last Post: snippsat
  Difficulty with installation standenman 0 725 May-02-2023, 08:33 PM
Last Post: standenman
  remove partial duplicates from csv ledgreve 0 894 Dec-12-2022, 04:21 PM
Last Post: ledgreve
  Removal of duplicates teebee891 1 1,883 Feb-01-2021, 12:06 PM
Last Post: jefsummers
  Displaying duplicates in dictionary lokesh 2 2,088 Oct-15-2020, 08:07 AM
Last Post: DeaD_EyE
  Difficulty in understanding transpose with a tuple of axis numbers in 3-D new_to_python 0 1,594 Feb-11-2020, 06:03 AM
Last Post: new_to_python
  Difficulty installing Pycrypto KipCarter 4 12,890 Feb-10-2020, 07:54 PM
Last Post: snippsat
  Deleting duplicates in tuples Den 2 2,841 Dec-14-2019, 10:32 PM
Last Post: ichabod801
  Help adapting code pacaeiro 0 1,826 Apr-25-2018, 10:13 AM
Last Post: pacaeiro

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020