Python Forum

Full Version: Extract nouns out of a CoNLL-file
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hi everybody,

I've got a CoNLL-file, which looks like this:

Quote:1 Janine Janine N NE _|Nom|Sg 2 subj _ _
2 langweilte langweilen V VVFIN 3|Sg|Past|_ 0 root _ _
3 sich sie PRO PRF 3|_|_ 2 obja _ _
4 so so ADV ADV _ 5 adv _ _
5 sehr sehr ADV ADV _ 2 adv _ _
6 . . $. $. _ 0 root _ _

1 Es es PRO PPER 3|Sg|Neut|Nom 2 subj _ _
2 war sein V VAFIN 3|Sg|Past|Ind 0 root _ _
3 ein eine ART ART Indef|_|Nom|Sg 4 det _ _
4 Morgen Morgen N NN _|Nom|Sg 2 pred _ _
5 des die ART ART Def|Masc|Gen|Sg 6 det _ _
6 Montags Montag N NN Masc|Gen|Sg 4 gmod _ _
7 und und KON KON _ 2 kon _ _
8 sie sie PRO PPER 3|Sg|Fem|Nom 9 subj _ _
9 saß sitzen V VVFIN 3|Sg|Past|Ind 7 cj _ _
10 in in PREP APPR Dat 9 pp _ _
11 der die ART ART Def|Fem|Dat|Sg 12 det _ _
12 Stunde Stunde N NN Fem|Dat|Sg 10 pn _ _
13 der die ART ART Def|Fem|Gen|Sg 14 det _ _
14 Mathematik Mathematik N NN Fem|Gen|Sg 12 gmod _ _
15 . . $. $. _ 0 root _ _

1 Sie sie PRO PPER 3|Sg|Fem|Nom 2 subj _ _
2 kniff kneifen V VVFIN 3|Sg|Past|Ind 0 root _ _
3 sich sie PRO PRF 3|_|Dat 12 objd _ _
4 in in PREP APPR Acc 12 pp _ _
5 die die ART ART Def|_|Acc|_ 6 det _ _
6 Spitze Spitz N NN _|Acc|_ 4 pn _ _
7 der die ART ART Def|Masc|Gen|Pl 8 det _ _
8 Finger Finger N NN Masc|Gen|Pl 6 gmod _ _
9 um um PREP APPR _ 12 pp _ _
10 wach wach ADV ADJD Pos| 9 pn _ _
11 zu zu PTKZU PTKZU _ 12 part _ _
12 bleiben bleiben V VVINF _ 2 obji _ _
13 . . $. $. _ 0 root _ _

1 Das die ART ART Def|Neut|_|Sg 2 det _ _
2 Haus Haus N NN Neut|_|Sg 0 root _ _
3 des die ART ART Def|Masc|Gen|Sg 4 det _ _
4 Bürgermeisters Bürgermeister N NN Masc|Gen|Sg 2 gmod _ _
5 . . $. $. _ 0 root _ _

1 Der die ART ART Def|Masc|Nom|Sg 2 det _ _
2 Anstieg Anstieg N NN Masc|Nom|Sg 0 root _ _
3 der die ART ART Def|_|Gen|Pl 4 det _ _
4 Kosten Kosten N NN _|Gen|Pl 2 gmod _ _
5 . . $. $. _ 0 root _ _

1 Der die ART ART Def|Fem|_|Sg 2 det _ _
2 Eingliederung Eingliederung N NN Fem|_|Sg 0 root _ _
3 der die ART ART Def|Masc|Gen|Pl 4 det _ _
4 Spätaussiedler Spätaussiedler N NN Masc|Gen|Pl 2 gmod _ _
5 . . $. $. _ 0 root _ _

1 Ein ein ART ART Indef|_|_|Sg 2 det _ _
2 Drittel Drittel N NN _|_|Sg 0 root _ _
3 der die ART ART Def|_|Gen|Pl 4 det _ _
4 Kosten Kosten N NN _|Gen|Pl 2 gmod _ _
5 . . $. $. _ 0 root _ _

1 Eine eine ART ART Indef|Fem|_|Sg 2 det _ _
2 Dame Dame N NN Fem|_|Sg 0 root _ _
3 eines eine ART ART Indef|_|Gen|Sg 5 det _ _
4 gewissen gewiss ADJA ADJA Pos|_|Gen|Sg|_| 5 attr _ _
5 Alters Alter N NN _|Gen|Sg 2 gmod _ _
6 . . $. $. _ 0 root _ _

1 Das die ART ART Def|Neut|_|Sg 2 det _ _
2 Glück Glück N NN Neut|_|Sg 0 root _ _
3 der die ART ART Def|Fem|Gen|Sg 4 det _ _
4 Zufriedenheit Zufriedenheit N NN Fem|Gen|Sg 2 gmod _ _
5 . . $. $. _ 0 root _ _

How is it possible to nicely extract all the nouns which have a genitive-attribute? 
I tried to search for nouns and to go (max.) 3 steps further - if there is a "Gen", I extract the noun-Gen-combo, if not, I search for the next noun.
But my idea isn't brillant, because it isn't very fast and what if there is a Gen 4 steps further? 
So my question is: how would you solve this problem?
Thanks a lot for your help! :)
Load the file into python nltk (Natural Language processor) as a new corpa, then you can use the bultin noun extractor
sorry, I haven't used in quite a while, so you will need to search for details.
I do know that it will do the job very nicely, though
Hey Larz60+ :)

Thanks a lot for your reply! Unfortunately, I couldn't do it using NLTK. I simply imported the file, searched for genitive attributes and extracted the noun before. This isn't a really fast way, and there can also be some faults...but since I couldn't figure out how it works with NLTK, I did it as described.

But if you know how it would work using NLTK (or if you know a link or something), I would still be very interested in it!

Thanks a lot,
Mat
There are instructions on how to load your own corpus in NLTK.
Here's one: https://technaverbascripta.wordpress.com...e-toolkit/
Once that is done, the corpus can be accessed through normal NLTK methods.
This post: http://stackoverflow.com/questions/17753...python-mad
shows how to get a list of nouns from a corpus.