Python Forum

Full Version: Text Processing and NLTK (POS tagging)
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hi, I want to write a function to take in text and POS (parts of speech) as parameters and return a sorted set list that returns the words according to what POS they belong to. So 'NOUN' as an argument would return all the noun words of the text. My current output is sort of close to the desired doctest output but obviously not quite. If you have a look at my output you can see all all the required words are there at the start of the lists. I would imagine doing a sorted() on those elements and a set() would help fix this but I'm not sure where to add those two methods. Does my code look right for what I'm trying to achieve or am i totally going about it wrong? Cheers.


import nltk
def distinct_words_of_pos(text, pos):
# Return the sorted list of distinct words with a given part of speech

sent_word_tokens = [nltk.word_tokenize(s) for s in nltk.sent_tokenize(text)]
all_pos = nltk.pos_tag_sents(sent_word_tokens, tagset="universal")

sorted_list = [ [x[0].lower() for x in sorted(el) if x[1] == pos] for el in all_pos]

return sorted_list
DOCTEST OUTPUT:
Output:
['[', ']', 'affection', 'austen', 'between', 'blessings', 'caresses', 'clever', 'consequence', 'daughters']
MY CURRENT OUTPUT:
Output:
 [['[', 'emma', 'jane', 'austen', ']', 'volume', 'emma', 'woodhouse', 'handsome', 'clever', 'home', 'disposition', 'blessings', 'existence', 'years', 'world'], ['daughters', 'father', 'consequence', 'sister', 'marriage', 'mistress', 'house', 'period'], ['mother', 'remembrance', 'caresses', 'place', 'woman', 'governess', 'mother', 'affection'], ['years', 'miss', 'taylor', 'mr.', 'woodhouse', 'family', 'governess', 'friend', 'fond', 'daughters', 'emma'], ['between', 'intimacy', 'sisters'], ['miss', 'taylor', 'office', 'governess', 'mildness', 'o']]
I know almost nothing about NLTK, but I did slight changes in your code:

import nltk

def distinct_words_of_pos(text, pos):
    sent_word_tokens = [nltk.word_tokenize(s) for s in nltk.sent_tokenize(text)]
    all_pos = nltk.pos_tag_sents(sent_word_tokens, tagset="universal")

    uniques = { x[0].lower() for el in all_pos for x in el if x[1]==pos }
    return sorted(uniques)
I guess that your test text is first 1000 characters of austen's emma:
Output:
In [52]: text = nltk.corpus.gutenberg.raw("austen-emma.txt")[:1000] In [53]: print(distinct_words_of_pos(text, "NOUN")[:10]) ['[', ']', 'affection', 'austen', 'between', 'blessings', 'caresses', 'clever', 'consequence', 'daughters']
that gives me output that seems identical with your doctest output.

I have changed just last two lines. Your all_pos is a list of lists of tuples - for each sentence there is a list with (word, pos) tuples. You need to "flatten" it before sorting and deduplicating, flattening can be done with modification of your list comprehension. And using a set comprehension instead of a list comprehension removes duplicities. After that its just sorting (that converts set to list too).
Zivoni you are legend your changes produce exactly what I was after. So using the {} brackets is another way of producing a set but without using the set() method. I did not know that. When ever I see curly brackets I think dictionary straight away. So by having the whole line expression enclosed in a single pair of brackets it "flattens" the set out instead of having the list of lists of tuples. I'll have to look deeper into this but that's awesome thank you.