Python Forum
Extract nouns out of a CoNLL-file
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Extract nouns out of a CoNLL-file
#1
Hi everybody,

I've got a CoNLL-file, which looks like this:

Quote:1 Janine Janine N NE _|Nom|Sg 2 subj _ _
2 langweilte langweilen V VVFIN 3|Sg|Past|_ 0 root _ _
3 sich sie PRO PRF 3|_|_ 2 obja _ _
4 so so ADV ADV _ 5 adv _ _
5 sehr sehr ADV ADV _ 2 adv _ _
6 . . $. $. _ 0 root _ _

1 Es es PRO PPER 3|Sg|Neut|Nom 2 subj _ _
2 war sein V VAFIN 3|Sg|Past|Ind 0 root _ _
3 ein eine ART ART Indef|_|Nom|Sg 4 det _ _
4 Morgen Morgen N NN _|Nom|Sg 2 pred _ _
5 des die ART ART Def|Masc|Gen|Sg 6 det _ _
6 Montags Montag N NN Masc|Gen|Sg 4 gmod _ _
7 und und KON KON _ 2 kon _ _
8 sie sie PRO PPER 3|Sg|Fem|Nom 9 subj _ _
9 saß sitzen V VVFIN 3|Sg|Past|Ind 7 cj _ _
10 in in PREP APPR Dat 9 pp _ _
11 der die ART ART Def|Fem|Dat|Sg 12 det _ _
12 Stunde Stunde N NN Fem|Dat|Sg 10 pn _ _
13 der die ART ART Def|Fem|Gen|Sg 14 det _ _
14 Mathematik Mathematik N NN Fem|Gen|Sg 12 gmod _ _
15 . . $. $. _ 0 root _ _

1 Sie sie PRO PPER 3|Sg|Fem|Nom 2 subj _ _
2 kniff kneifen V VVFIN 3|Sg|Past|Ind 0 root _ _
3 sich sie PRO PRF 3|_|Dat 12 objd _ _
4 in in PREP APPR Acc 12 pp _ _
5 die die ART ART Def|_|Acc|_ 6 det _ _
6 Spitze Spitz N NN _|Acc|_ 4 pn _ _
7 der die ART ART Def|Masc|Gen|Pl 8 det _ _
8 Finger Finger N NN Masc|Gen|Pl 6 gmod _ _
9 um um PREP APPR _ 12 pp _ _
10 wach wach ADV ADJD Pos| 9 pn _ _
11 zu zu PTKZU PTKZU _ 12 part _ _
12 bleiben bleiben V VVINF _ 2 obji _ _
13 . . $. $. _ 0 root _ _

1 Das die ART ART Def|Neut|_|Sg 2 det _ _
2 Haus Haus N NN Neut|_|Sg 0 root _ _
3 des die ART ART Def|Masc|Gen|Sg 4 det _ _
4 Bürgermeisters Bürgermeister N NN Masc|Gen|Sg 2 gmod _ _
5 . . $. $. _ 0 root _ _

1 Der die ART ART Def|Masc|Nom|Sg 2 det _ _
2 Anstieg Anstieg N NN Masc|Nom|Sg 0 root _ _
3 der die ART ART Def|_|Gen|Pl 4 det _ _
4 Kosten Kosten N NN _|Gen|Pl 2 gmod _ _
5 . . $. $. _ 0 root _ _

1 Der die ART ART Def|Fem|_|Sg 2 det _ _
2 Eingliederung Eingliederung N NN Fem|_|Sg 0 root _ _
3 der die ART ART Def|Masc|Gen|Pl 4 det _ _
4 Spätaussiedler Spätaussiedler N NN Masc|Gen|Pl 2 gmod _ _
5 . . $. $. _ 0 root _ _

1 Ein ein ART ART Indef|_|_|Sg 2 det _ _
2 Drittel Drittel N NN _|_|Sg 0 root _ _
3 der die ART ART Def|_|Gen|Pl 4 det _ _
4 Kosten Kosten N NN _|Gen|Pl 2 gmod _ _
5 . . $. $. _ 0 root _ _

1 Eine eine ART ART Indef|Fem|_|Sg 2 det _ _
2 Dame Dame N NN Fem|_|Sg 0 root _ _
3 eines eine ART ART Indef|_|Gen|Sg 5 det _ _
4 gewissen gewiss ADJA ADJA Pos|_|Gen|Sg|_| 5 attr _ _
5 Alters Alter N NN _|Gen|Sg 2 gmod _ _
6 . . $. $. _ 0 root _ _

1 Das die ART ART Def|Neut|_|Sg 2 det _ _
2 Glück Glück N NN Neut|_|Sg 0 root _ _
3 der die ART ART Def|Fem|Gen|Sg 4 det _ _
4 Zufriedenheit Zufriedenheit N NN Fem|Gen|Sg 2 gmod _ _
5 . . $. $. _ 0 root _ _

How is it possible to nicely extract all the nouns which have a genitive-attribute? 
I tried to search for nouns and to go (max.) 3 steps further - if there is a "Gen", I extract the noun-Gen-combo, if not, I search for the next noun.
But my idea isn't brillant, because it isn't very fast and what if there is a Gen 4 steps further? 
So my question is: how would you solve this problem?
Thanks a lot for your help! :)
Reply
#2
Load the file into python nltk (Natural Language processor) as a new corpa, then you can use the bultin noun extractor
sorry, I haven't used in quite a while, so you will need to search for details.
I do know that it will do the job very nicely, though
Reply
#3
Hey Larz60+ :)

Thanks a lot for your reply! Unfortunately, I couldn't do it using NLTK. I simply imported the file, searched for genitive attributes and extracted the noun before. This isn't a really fast way, and there can also be some faults...but since I couldn't figure out how it works with NLTK, I did it as described.

But if you know how it would work using NLTK (or if you know a link or something), I would still be very interested in it!

Thanks a lot,
Mat
Reply
#4
There are instructions on how to load your own corpus in NLTK.
Here's one: https://technaverbascripta.wordpress.com...e-toolkit/
Once that is done, the corpus can be accessed through normal NLTK methods.
This post: http://stackoverflow.com/questions/17753...python-mad
shows how to get a list of nouns from a corpus.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Extract file only (without a directory it is in) from ZIPIP tester_V 1 1,043 Jan-23-2023, 04:56 AM
Last Post: deanhystad
  How to extract specific data from .SRC (note pad file) Shinny_Shin 2 1,299 Jul-27-2022, 12:31 PM
Last Post: Larz60+
  Extract parts of a log-file and put it in a dataframe hasiro 4 6,428 Apr-08-2022, 01:18 PM
Last Post: hasiro
  Extract a string between 2 words from a text file OscarBoots 2 1,899 Nov-02-2021, 08:50 AM
Last Post: ibreeden
  Extract specific sentences from text file Bubly 3 3,461 May-31-2021, 06:55 PM
Last Post: Larz60+
  Add a new column when I extract each sheet in an Excel workbook as a new csv file shantanu97 0 2,264 Mar-24-2021, 04:56 AM
Last Post: shantanu97
  How to extract a single word from a text file buttercup 7 3,692 Jul-22-2020, 04:45 AM
Last Post: bowlofred
  How to extract MSS (maximum size segment) from a pcap file ? salwa17 0 1,717 Jun-29-2020, 09:06 AM
Last Post: salwa17
  How to extract specific rows and columns from a text file with Python Farhan 0 3,417 Mar-25-2020, 09:18 PM
Last Post: Farhan
  Extract info from 3d file DindonPiere 1 1,980 Nov-06-2019, 05:38 PM
Last Post: Larz60+

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020