Python Forum
Text Processing and NLTK (POS tagging)
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Text Processing and NLTK (POS tagging)
#1
Hi, I want to write a function to take in text and POS (parts of speech) as parameters and return a sorted set list that returns the words according to what POS they belong to. So 'NOUN' as an argument would return all the noun words of the text. My current output is sort of close to the desired doctest output but obviously not quite. If you have a look at my output you can see all all the required words are there at the start of the lists. I would imagine doing a sorted() on those elements and a set() would help fix this but I'm not sure where to add those two methods. Does my code look right for what I'm trying to achieve or am i totally going about it wrong? Cheers.


import nltk
def distinct_words_of_pos(text, pos):
# Return the sorted list of distinct words with a given part of speech

sent_word_tokens = [nltk.word_tokenize(s) for s in nltk.sent_tokenize(text)]
all_pos = nltk.pos_tag_sents(sent_word_tokens, tagset="universal")

sorted_list = [ [x[0].lower() for x in sorted(el) if x[1] == pos] for el in all_pos]

return sorted_list
DOCTEST OUTPUT:
Output:
['[', ']', 'affection', 'austen', 'between', 'blessings', 'caresses', 'clever', 'consequence', 'daughters']
MY CURRENT OUTPUT:
Output:
 [['[', 'emma', 'jane', 'austen', ']', 'volume', 'emma', 'woodhouse', 'handsome', 'clever', 'home', 'disposition', 'blessings', 'existence', 'years', 'world'], ['daughters', 'father', 'consequence', 'sister', 'marriage', 'mistress', 'house', 'period'], ['mother', 'remembrance', 'caresses', 'place', 'woman', 'governess', 'mother', 'affection'], ['years', 'miss', 'taylor', 'mr.', 'woodhouse', 'family', 'governess', 'friend', 'fond', 'daughters', 'emma'], ['between', 'intimacy', 'sisters'], ['miss', 'taylor', 'office', 'governess', 'mildness', 'o']]
Reply
#2
I know almost nothing about NLTK, but I did slight changes in your code:

import nltk

def distinct_words_of_pos(text, pos):
    sent_word_tokens = [nltk.word_tokenize(s) for s in nltk.sent_tokenize(text)]
    all_pos = nltk.pos_tag_sents(sent_word_tokens, tagset="universal")

    uniques = { x[0].lower() for el in all_pos for x in el if x[1]==pos }
    return sorted(uniques)
I guess that your test text is first 1000 characters of austen's emma:
Output:
In [52]: text = nltk.corpus.gutenberg.raw("austen-emma.txt")[:1000] In [53]: print(distinct_words_of_pos(text, "NOUN")[:10]) ['[', ']', 'affection', 'austen', 'between', 'blessings', 'caresses', 'clever', 'consequence', 'daughters']
that gives me output that seems identical with your doctest output.

I have changed just last two lines. Your all_pos is a list of lists of tuples - for each sentence there is a list with (word, pos) tuples. You need to "flatten" it before sorting and deduplicating, flattening can be done with modification of your list comprehension. And using a set comprehension instead of a list comprehension removes duplicities. After that its just sorting (that converts set to list too).
Reply
#3
Zivoni you are legend your changes produce exactly what I was after. So using the {} brackets is another way of producing a set but without using the set() method. I did not know that. When ever I see curly brackets I think dictionary straight away. So by having the whole line expression enclosed in a single pair of brackets it "flattens" the set out instead of having the list of lists of tuples. I'll have to look deeper into this but that's awesome thank you.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Help with simple nltk Chatbot Extra 3 1,879 Jan-02-2022, 07:50 AM
Last Post: bepammoifoge
  Saving a download of stopwords (nltk) Drone4four 1 9,270 Nov-19-2020, 11:50 PM
Last Post: snippsat
  Installing nltk dependency Eshwar 0 1,825 Aug-30-2020, 06:10 PM
Last Post: Eshwar
  Analyzing large text file with nltk.corpus (stopwords ) Drone4four 9 6,459 Jun-06-2019, 09:30 PM
Last Post: Drone4four
  AWS ELB auto tagging with Lambda and boto 3 karteekdavid 4 3,619 Aug-14-2018, 03:26 AM
Last Post: karteekdavid
  Clean Data using NLTK disruptfwd8 0 3,321 May-12-2018, 11:21 PM
Last Post: disruptfwd8
  NLTK create corpora pythlang 5 10,176 Oct-26-2016, 07:31 PM
Last Post: Larz60+
  serious n00b.. NLTK in python 2.7 and 3.5 pythlang 24 19,699 Oct-21-2016, 04:15 PM
Last Post: pythlang
  Corpora catalof for NLTK Larz60+ 1 4,109 Oct-20-2016, 02:31 AM
Last Post: Larz60+

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020