Text Processing and NLTK (POS tagging)

TwelveMoons · (This post was last modified: Mar-15-2017, 07:36 PM by Larz60+.)

Hi, I want to write a function to take in text and POS (parts of speech) as parameters and return a sorted set list that returns the words according to what POS they belong to. So 'NOUN' as an argument would return all the noun words of the text. My current output is sort of close to the desired doctest output but obviously not quite. If you have a look at my output you can see all all the required words are there at the start of the lists. I would imagine doing a sorted() on those elements and a set() would help fix this but I'm not sure where to add those two methods. Does my code look right for what I'm trying to achieve or am i totally going about it wrong? Cheers.

import nltk
def distinct_words_of_pos(text, pos):
# Return the sorted list of distinct words with a given part of speech

sent_word_tokens = [nltk.word_tokenize(s) for s in nltk.sent_tokenize(text)]
all_pos = nltk.pos_tag_sents(sent_word_tokens, tagset="universal")

sorted_list = [ [x[0].lower() for x in sorted(el) if x[1] == pos] for el in all_pos]

return sorted_list

DOCTEST OUTPUT:

Output:
['[', ']', 'affection', 'austen', 'between', 'blessings', 'caresses', 'clever', 'consequence', 'daughters']

MY CURRENT OUTPUT:

Output:
 [['[', 'emma', 'jane', 'austen', ']', 'volume', 'emma', 'woodhouse', 'handsome', 'clever', 'home', 'disposition', 'blessings', 'existence', 'years', 'world'], ['daughters', 'father', 'consequence', 'sister', 'marriage', 'mistress', 'house', 'period'], ['mother', 'remembrance', 'caresses', 'place', 'woman', 'governess', 'mother', 'affection'], ['years', 'miss', 'taylor', 'mr.', 'woodhouse', 'family', 'governess', 'friend', 'fond', 'daughters', 'emma'], ['between', 'intimacy', 'sisters'], ['miss', 'taylor', 'office', 'governess', 'mildness', 'o']]

***zivoni*** · (This post was last modified: Mar-15-2017, 11:17 PM by zivoni.)

I know almost nothing about NLTK, but I did slight changes in your code:

import nltk

def distinct_words_of_pos(text, pos):
    sent_word_tokens = [nltk.word_tokenize(s) for s in nltk.sent_tokenize(text)]
    all_pos = nltk.pos_tag_sents(sent_word_tokens, tagset="universal")

    uniques = { x[0].lower() for el in all_pos for x in el if x[1]==pos }
    return sorted(uniques)

I guess that your test text is first 1000 characters of austen's emma:

Output:In [52]: text = nltk.corpus.gutenberg.raw("austen-emma.txt")[:1000]

In [53]: print(distinct_words_of_pos(text, "NOUN")[:10])
['[', ']', 'affection', 'austen', 'between', 'blessings', 'caresses', 'clever', 'consequence', 'daughters']

that gives me output that seems identical with your doctest output.

I have changed just last two lines. Your all_pos is a list of lists of tuples - for each sentence there is a list with (word, pos) tuples. You need to "flatten" it before sorting and deduplicating, flattening can be done with modification of your list comprehension. And using a set comprehension instead of a list comprehension removes duplicities. After that its just sorting (that converts set to list too).

TwelveMoons · Mar-16-2017, 02:53 AM

Zivoni you are legend your changes produce exactly what I was after. So using the {} brackets is another way of producing a set but without using the set() method. I did not know that. When ever I see curly brackets I think dictionary straight away. So by having the whole line expression enclosed in a single pair of brackets it "flattens" the set out instead of having the list of lists of tuples. I'll have to look deeper into this but that's awesome thank you.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Help with simple nltk Chatbot	Extra	3	1,879	Jan-02-2022, 07:50 AM Last Post: bepammoifoge
	Saving a download of stopwords (nltk)	Drone4four	1	9,270	Nov-19-2020, 11:50 PM Last Post: snippsat
	Installing nltk dependency	Eshwar	0	1,825	Aug-30-2020, 06:10 PM Last Post: Eshwar
	Analyzing large text file with nltk.corpus (stopwords )	Drone4four	9	6,459	Jun-06-2019, 09:30 PM Last Post: Drone4four
	AWS ELB auto tagging with Lambda and boto 3	karteekdavid	4	3,619	Aug-14-2018, 03:26 AM Last Post: karteekdavid
	Clean Data using NLTK	disruptfwd8	0	3,321	May-12-2018, 11:21 PM Last Post: disruptfwd8
	NLTK create corpora	pythlang	5	10,176	Oct-26-2016, 07:31 PM Last Post: Larz60+
	serious n00b.. NLTK in python 2.7 and 3.5	pythlang	24	19,699	Oct-21-2016, 04:15 PM Last Post: pythlang
	Corpora catalof for NLTK	Larz60+	1	4,109	Oct-20-2016, 02:31 AM Last Post: Larz60+

Text Processing and NLTK (POS tagging)

User Panel Messages

Announcements