Python Forum
Difficulty in adapting duplicates-filter in script
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Difficulty in adapting duplicates-filter in script
#1
Hello everyone,

I have a script to parse and extract annotated data from xmi-files. I did not write this script myself (person who did is not reachable), have no coding/programming background and really need a solution asap. I've tried adapting the script with the use of ChatGPT code-interpreter, but I've kind of gotten stuck.

I annotated data on three levels: aspects, polarity triggers and named entities. The problem is related to the extraction of the aspects. I've just noticed that the script only extracts part of the aspects and dismisses what it deems to be "duplicates" based on its target word(group) and (I assume) aspect label. However, repetition in common in my data, so a lot of aspects disappeared, meaning that my data is incomplete. Ideally, there would thus be an addition marker to check whether it is actually a duplicate or not.

There are 4 scripts, I will try to include them in full with an example of one of the xmi-files:
  • parsexml_may2021_cvh3.py (main script)
  • AnnotationsObject.py
  • Casobject.py
  • Documentobject.py

The script is pretty long, so I'll only paste the most relevant part of the code related to the aspect extraction here:

def write_annotations_aspectcategory(allAnnotationsDict, annotatornames, outfilefolder):
	allaspectsdict = get_all_aspects(allAnnotationsDict)
	aspecttermslist =[]
	'''
	Gets, for every unique aspect (allaspectslist), its subcategory given by each annotator
	'''
	with open(os.path.join(outfilefolder,'annotations_aspectcategory.csv'), 'w') as csvfile:
		outfilewriter = csv.writer(csvfile, delimiter='\t', quotechar='|', quoting=csv.QUOTE_MINIMAL)
		outfilewriter.writerow(['Document', 'Aspect'] + [a for a in annotatornames] + ['Sentence'])
		for documentname, annotatordict in allAnnotationsDict.items():
			#Get a list of all aspects per document, consider the first annotator in the dict, as the aspects are the same for each annotator.
			allaspects = allaspectsdict[documentname][annotatornames[0]]
			allaspectstrings = ['+'.join(x.aspecttext) for x in allaspects] #Aspects can be stored as lists of two non-consecutive spans
			allaspectsentences = [x.sentence for x in allaspects]
			# allaspectsentenceswithoutaspect = [x.sentencewithoutaspect for x in allaspects]
			seen = []
			for aspectText, aspectSent in zip(allaspectstrings, allaspectsentences):
				aspecttermslist.append(aspectText)
				rowelements = [] #Rowelements will be written to output file
				rowelements.append(documentname)
				rowelements.append(aspectText) #rowelements.append(aspectText.replace('\xad', ''))
				for a in annotatornames:
					cat = []
					aspectobjectslist = allaspectsdict[documentname][a]
					for aspobj in aspectobjectslist:
						found = False
						if '+'.join(aspobj.aspecttext) == aspectText and aspobj.sentence == aspectSent:
							if aspobj.category == None:
								cat.append('None')
							else:
								cat.append(aspobj.category)
							found = True
						else:
							continue
						if not found:
							print('Warning: aspect\t{0}\t{1}\tfrom annotator {2} not found in allaspectslist.'.format(aspobj.aspecttext, documentname, a))
							warnings.append('Write annotations for aspect polarities: aspect\t{0}\t{1}\tfrom annotator {2} not found in allaspectslist.'.format(aspobj.aspecttext, documentname, a))
					if len(cat) > 1 and 'None' in cat: #Aspect polarity can be 'None' if the Aspect is part of a linked aspect span and its polarity was added to the other part
						indx_none = pol.index('None') #Remove 'None' polarities
						cat.pop(indx_none)
					rowelements.append(','.join(list(set(cat))))
				rowelements.append(aspectSent) #rowelements.append(aspectSent.replace('\xad', ''))
				# rowelements.append(tuple([x.replace('\xad', '') for x in aspectSentWith]))
				if not (aspectText,aspectSent) in seen: #Discard 100% duplicates (i.e. aspects that are annotated twice in the sentence)
					outfilewriter.writerow(rowelements)
				seen.append((aspectText,aspectSent))
	with open(os.path.join(outfilefolder, 'aspectTerms.txt'), 'w', encoding = 'utf-8') as f:
		for el in aspecttermslist:
			f.write(el + '\n')
This appears to be the part which specifically removes the duplicates:
				if not (aspectText,aspectSent) in seen: #Discard 100% duplicates (i.e. aspects that are annotated twice in the sentence)
					outfilewriter.writerow(rowelements)
				seen.append((aspectText,aspectSent))
To make it a bit more concrete (and offer a way to check), for the xmi-file I've included, the script only extracts 13 aspects with this label "CONTENDER_General", but there are supposed to be 53. Completely removing the duplicates filter is not a solution, because I've noticed that there are actual duplicates as well and I cannot filter them out because it's impossible to discern which one is truly repeated or is an actual duplicate. It would mean a lot if someone would be willing to help!

If you take a look at the xmi-file, it shows that each aspect has a unique ID as well as beginning and end-positition:

    <custom:Aspects xmi:id="55041" sofa="1" begin="1" end="8" FeatureCategory="CONTENDER_General"/>
    <custom:Aspects xmi:id="58618" sofa="1" begin="38" end="54" FeatureCategory="ONSITE-AUDIENCE_General"/>
    <custom:Aspects xmi:id="55046" sofa="1" begin="90" end="93" FeatureCategory="CONTENDER_General"/>
    <custom:Aspects xmi:id="55051" sofa="1" begin="102" end="109" FeatureCategory="CONTENDER_General"/>
In theory it should thus be possible to let it only dismiss only those doubles which also share either the same ID or the same span. However, I do not have the coding skill (I'm already happy if I understand what a sript does) and it seems to be too complicated for ChatGPT (it was able to help me remove the duplicate filter entirely, but that did not solve the problem).

In order to run the script on the file, they all need to be saved in the same folder with the xmi-file in a subfolder named "annotation" and then another subfolder within this ending in "_xmi" with the xmi-file in it (apologies for this, the forum won't let me simply upload the ready to use folder). Create another subfolder for the output ("OUTPUT"). Then run this command in the main folder in Powershell:
python parsexml_may2021_cvh3.py annotation/ lore no OUTPUT
I would be very grateful if someone would be willing to take a look and help, thank you in advance!





PS: apparently I cannot add a xmi-file and it is too long to include here (ca. 5000 lines) or to upload as a csv. iNot sure how to solve this problem. My apologies for the inconvenience!



Edit: added an attachment

Attached Files

.py   Documentobject.py (Size: 182 bytes / Downloads: 72)
.py   parsexml_may2021_cvh3.py (Size: 33.68 KB / Downloads: 96)
.py   AnnotationsObject.py (Size: 974 bytes / Downloads: 100)
.py   Casobject.py (Size: 95 bytes / Downloads: 80)
.txt   lore.txt (Size: 19.74 KB / Downloads: 86)
Reply
#2
Obviously this statement is a lie.
Quote:I've noticed that there are actual duplicates as well and I cannot filter them out because it's impossible to discern which one is truly repeated or is an actual duplicate
If you've noticed there are "actual duplicates", there must be some way to discern which is truly repeated and which is an actual duplicate. How did you identify the duplicates? Describing the difference between a duplicate and a repeat is the first step in designing a better filter.

You say this:
Quote:In theory it should thus be possible to let it only dismiss only those doubles which also share either the same ID or the same span.
What do you mean by ID and span? I don't see those in the posted code and I don't want to read the python files. If there is an ID, the solution might be as easy as adding this ID to the tuples saved in "seen". If span is some portion of a document, the solution might be resetting the "seen" list to empty after processing each span.
Reply
#3
Hello @deanhystad, thank you for replying. The way I identified the duplicates was by manually counting them in this annotation file (53) and then looking how many there were in the csv containing the extracted data (56). Because there were more than the actual account, the extras must be duplicates. Of course, mannually counting them in every file is not an option, it was just a way to verify whether the script worked in this case.

The ID ("xmi:id") and span (consisting of "begin" and "end" combined) are included in the example of part of the xmi-file I provided:
    <custom:Aspects xmi:id="55041" sofa="1" begin="1" end="8" FeatureCategory="CONTENDER_General"/>
    <custom:Aspects xmi:id="58618" sofa="1" begin="38" end="54" FeatureCategory="ONSITE-AUDIENCE_General"/>
    <custom:Aspects xmi:id="55046" sofa="1" begin="90" end="93" FeatureCategory="CONTENDER_General"/>
    <custom:Aspects xmi:id="55051" sofa="1" begin="102" end="109" FeatureCategory="CONTENDER_General"/>
For example, the ID of the first aspect in this short list is "55041" and its span is [1,8].

As mentioned, I do not have python coding experience myself. How could I do this?
Reply
#4
@deanhystad I have been trying to adapt the script using ChatGPT based on what you said and there is now an extra condition for the filter. Previously it looked at the aspect and sentence (though i do not think the sentence thing actually worked as it should have, otherwise most of these fake duplicates would not have been removed, with the help of ChatGPT I was able to add a third condition namely the ID I mentioned:
				rowelements.append(aspectSent)
				if not (aspectText,aspectSent, aspectID) in seen:  # check if aspectID is in seen
					outfilewriter.writerow(rowelements)
				seen.append((aspectText,aspectSent, aspectID))  # add aspectID to seen
The script is able to run, however the results are the exact same, so now I do not know whether it is the same because it did not work or because it did, but for some strange reason it still removed the same aspects that should not have been removed.
Reply
#5
Open up one of the xmi files. Copy two or more annotations. Paste as text.
Reply
#6
@deanhystad Of course, here you go:
    <custom:Aspects xmi:id="1384" sofa="1" begin="5" end="47" FeatureCategory="JURY_General"/>
    <custom:Aspects xmi:id="1379" sofa="1" begin="49" end="82" FeatureCategory="META_Winner_Award-Ceremony"/>
    <custom:Aspects xmi:id="1389" sofa="1" begin="93" end="103" FeatureCategory="CONTENDER_General"/>
    <custom:Aspects xmi:id="1394" sofa="1" begin="114" end="122" FeatureCategory="TEXT_General"/>
I've also found one of the xmi files that was small enough, so I've saved it as a txt and include it. I cannot add it to this reply, so I will edit my main post and add it to the attachments there. It's called "lore.txt". That way you can see the ful structure of the document if you like.

In the "AnnotationsObject.py" I added this new line:
class AspectObject:
	def __init__(self):
		self.aspecttext = []
		self.polarity = None  
		self.aspectcategory = None #--> not necessary
		self.sameclause = False
		self.sentence = None
		self.sentencerange = None
		self.sentencewithoutaspect = None #Sentence from which the aspect span itself is withdrawn
		self.aspectID = None  # NEW LINE THAT WAS ADDED
The complete function now looks like this:
def write_annotations_aspectcategory(allAnnotationsDict, annotatornames, outfilefolder):
	allaspectsdict = get_all_aspects(allAnnotationsDict)
	aspecttermslist =[]
	with open(os.path.join(outfilefolder,'annotations_aspectcategory.csv'), 'w') as csvfile:
		outfilewriter = csv.writer(csvfile, delimiter='\t', quotechar='|', quoting=csv.QUOTE_MINIMAL)
		outfilewriter.writerow(['Document', 'Aspect'] + [a for a in annotatornames] + ['Sentence'])
		for documentname, annotatordict in allAnnotationsDict.items():
			allaspects = allaspectsdict[documentname][annotatornames[0]]
			allaspectstrings = ['+'.join(x.aspecttext) for x in allaspects]
			allaspectsentences = [x.sentence for x in allaspects]
			seen = []
			for aspectObj, aspectText, aspectSent in zip(allaspects, allaspectstrings, allaspectsentences):
				aspecttermslist.append(aspectText)
				rowelements = []
				rowelements.append(documentname)
				rowelements.append(aspectText)
				aspectID = aspectObj.aspectID  # NEW LINE
				# Print debugging output
				print(f"Processing aspect: text={aspectText}, sent={aspectSent}, id={aspectID}")
				for a in annotatornames:
					cat = []
					aspectobjectslist = allaspectsdict[documentname][a]
					for aspobj in aspectobjectslist:
						found = False
						if '+'.join(aspobj.aspecttext) == aspectText and aspobj.sentence == aspectSent and aspobj.aspectID == aspectID:  # compare aspectID
							if aspobj.category == None:
								cat.append('None')
							else:
								cat.append(aspobj.category)
							found = True
						else:
							continue
						if not found:
							print('Warning: aspect\t{0}\t{1}\tfrom annotator {2} not found in allaspectslist.'.format(aspobj.aspecttext, documentname, a))
							warnings.append('Write annotations for aspect polarities: aspect\t{0}\t{1}\tfrom annotator {2} not found in allaspectslist.'.format(aspobj.aspecttext, documentname, a))
					if len(cat) > 1 and 'None' in cat: 
						indx_none = pol.index('None') 
						cat.pop(indx_none)
					rowelements.append(','.join(list(set(cat))))
				rowelements.append(aspectSent)
				if not (aspectText,aspectSent, aspectID) in seen:  # check if aspectID is in seen
					outfilewriter.writerow(rowelements)
					print(f"New aspect, writing to output: text={aspectText}, sent={aspectSent}, id={aspectID}")
				else:
					print(f"Duplicate aspect, not writing to output: text={aspectText}, sent={aspectSent}, id={aspectID}")
				seen.append((aspectText,aspectSent, aspectID))  # ADD ASPECTID TO BE SEEN
				print(f"Current contents of seen: {seen}")
	with open(os.path.join(outfilefolder, 'aspectTerms.txt'), 'w', encoding = 'utf-8') as f:
		for el in aspecttermslist:
			f.write(el + '\n')
The output given by the print station shows a "none"-value for the xmi:id, so my guess is it is simply not getting extracted. Ive tried using ChatGPT's Code interpreter to find out how I could solve this problem and get it actually extracted, however it keeps looping and inventing completely new parts of the script that do not actually exist.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Difficulty with installation standenman 2 1,015 May-03-2023, 06:39 PM
Last Post: snippsat
  Difficulty with installation standenman 1 656 May-02-2023, 09:29 PM
Last Post: sudoku6
  remove partial duplicates from csv ledgreve 0 824 Dec-12-2022, 04:21 PM
Last Post: ledgreve
  Problem : Count the number of Duplicates NeedHelpPython 3 4,437 Dec-16-2021, 06:53 AM
Last Post: Gribouillis
  Removal of duplicates teebee891 1 1,825 Feb-01-2021, 12:06 PM
Last Post: jefsummers
  Displaying duplicates in dictionary lokesh 2 2,030 Oct-15-2020, 08:07 AM
Last Post: DeaD_EyE
  Difficulty in understanding transpose with a tuple of axis numbers in 3-D new_to_python 0 1,553 Feb-11-2020, 06:03 AM
Last Post: new_to_python
  Difficulty installing Pycrypto KipCarter 4 12,355 Feb-10-2020, 07:54 PM
Last Post: snippsat
  how do i pass duplicates in my range iterator? pseudo 3 2,401 Dec-18-2019, 03:01 PM
Last Post: ichabod801
  Deleting duplicates in tuples Den 2 2,783 Dec-14-2019, 10:32 PM
Last Post: ichabod801

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020