Jun-19-2017, 09:35 AM
Currently, I'm working with connl-files which look like the one in the attachment (saved as .txt-file, since bz2-files are not allowed to be uploaded).
I'd like to extract all the heads of genitives, and the genitives themselves. My actual code looks like this:
My main problem is really to get all the heads with all the genitive forms. A desired output would be:
Thanks a lot for any advice or hint.
I'd like to extract all the heads of genitives, and the genitives themselves. My actual code looks like this:
import csv, bz2 from collections import Counter, deque, defaultdict names = ('1', '2', '3', '4', '5', '6', '7', '8', '9', '10') filename = "test.conll.bz2" nouns = Counter() d = defaultdict(list) with open(filename) as f: f = bz2.BZ2File(filename, "rb") reader = csv.DictReader(f, fieldnames=names, delimiter='\t') act_index = 0 last = deque(maxlen=50) for tok in reader: last.append(tok) if (tok['5'] == 'NN' or tok['5'] == 'NE') and 'Gen' in tok['6']: dep = tok['7'] act_index = list(last).index(tok) act = tok['3'] while last[act_index-1]['7'] != dep: act_index = act_index-1 while last[act_index]['5'] != 'NN': act_index = act_index+1 else: nouns.update(last[act_index]['3'].split()) d[last[act_index]['3']].append(act) if __name__ == '__main__': for el in sorted(nouns, key=nouns.get, reverse=True): Gen = "" for e in d[el]: if e not in Gen: Gen += e + "," Gen = Gen[:-1] print el + "\t" + GenMy code has some feeblenesses...example: sometimes, I get an error "deque index out of range", but I don't know why. Then, I observed that if in the connl-file are two times a similar sentence one after the other, my code does not work properly (it only adds the first occurrence). And finally, I think that my code could be written more efficiently.
My main problem is really to get all the heads with all the genitive forms. A desired output would be:
Quote:Geburt \t Kind
Henker \t Geschichte
Kind \t Herr,Adam
Thanks a lot for any advice or hint.
Attached Files