Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Get data out of connl-file
#1
Currently, I'm working with connl-files which look like the one in the attachment (saved as .txt-file, since bz2-files are not allowed to be uploaded).

I'd like to extract all the heads of genitives, and the genitives themselves. My actual code looks like  this:

import csv, bz2
from collections import Counter, deque, defaultdict

names = ('1', '2', '3', '4', '5', '6', '7', '8', '9', '10')

filename = "test.conll.bz2"

nouns = Counter()
d = defaultdict(list)

with open(filename) as f:
   f = bz2.BZ2File(filename, "rb")
   reader = csv.DictReader(f, fieldnames=names, delimiter='\t')

   act_index = 0
   last = deque(maxlen=50)
   for tok in reader:
       last.append(tok)
       if (tok['5'] == 'NN' or tok['5'] == 'NE') and 'Gen' in tok['6']:
           dep = tok['7']
           act_index = list(last).index(tok) 
           act = tok['3']
           while last[act_index-1]['7'] != dep:
               act_index = act_index-1
           while last[act_index]['5'] != 'NN':
               act_index = act_index+1
           else:
               nouns.update(last[act_index]['3'].split())
               d[last[act_index]['3']].append(act)



if __name__ == '__main__':

   for el in sorted(nouns, key=nouns.get, reverse=True):
       Gen = ""
       for e in d[el]:
           if e not in Gen:
               Gen += e + ","
           
       Gen = Gen[:-1] 
       print el + "\t" + Gen
My code has some feeblenesses...example: sometimes, I get an error "deque index out of range", but I don't know why. Then, I observed that if in the connl-file are two times a similar sentence one after the other, my code does not work properly (it only adds the first occurrence). And finally, I think that my code could be written more efficiently.

My main problem is really to get all the heads with all the genitive forms. A desired output would be:
Quote:Geburt \t Kind
Henker \t Geschichte
Kind \t Herr,Adam

Thanks a lot for any advice or hint.

Attached Files

.txt   test.txt (Size: 12.43 KB / Downloads: 574)
Reply
#2
I think your index error on the deque is because your while loop upping/lowering the index. If the last item in that direction matches the while condition, the index will get too large/small for the deque, and give you an error. Like:

breakfast = ['spam', 'spam', 'spam', 'eggs']
index = 0
while True:
    print(breakfast[index])
    index += 1
The other problem I think has to do with the keys to your default dict being the third item, which is not unique. Each key can only hold one value. I think you want to append rather than updating:

factors = defautdict(list)
for num, factor in [(2, 1), (2, 2), (3, 1), (3, 3), (4, 1), (4, 2), (4, 4)]:
    factors[num].append(factor)
Craig "Ichabod" O'Brien - xenomind.com
I wish you happiness.
Recommended Tutorials: BBCode, functions, classes, text adventures
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  xml file creation from an XML file template and data from an excel file naji_python 1 2,125 Dec-21-2020, 03:24 PM
Last Post: Gribouillis
  How to save CSV file data into the Azure Data Lake Storage Gen2 table? Mangesh121 0 2,117 Jun-26-2020, 11:59 AM
Last Post: Mangesh121

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020