Mar-15-2017, 02:12 PM
Hi, I want to write a function to take in text and POS (parts of speech) as parameters and return a sorted set list that returns the words according to what POS they belong to. So 'NOUN' as an argument would return all the noun words of the text. My current output is sort of close to the desired doctest output but obviously not quite. If you have a look at my output you can see all all the required words are there at the start of the lists. I would imagine doing a sorted() on those elements and a set() would help fix this but I'm not sure where to add those two methods. Does my code look right for what I'm trying to achieve or am i totally going about it wrong? Cheers.
import nltk def distinct_words_of_pos(text, pos): # Return the sorted list of distinct words with a given part of speech sent_word_tokens = [nltk.word_tokenize(s) for s in nltk.sent_tokenize(text)] all_pos = nltk.pos_tag_sents(sent_word_tokens, tagset="universal") sorted_list = [ [x[0].lower() for x in sorted(el) if x[1] == pos] for el in all_pos] return sorted_listDOCTEST OUTPUT:
Output:['[', ']', 'affection', 'austen', 'between', 'blessings', 'caresses', 'clever', 'consequence', 'daughters']
MY CURRENT OUTPUT:Output: [['[', 'emma', 'jane', 'austen', ']', 'volume', 'emma', 'woodhouse', 'handsome', 'clever', 'home', 'disposition', 'blessings', 'existence', 'years', 'world'], ['daughters', 'father', 'consequence', 'sister', 'marriage', 'mistress', 'house', 'period'], ['mother', 'remembrance', 'caresses', 'place', 'woman', 'governess', 'mother', 'affection'], ['years', 'miss', 'taylor', 'mr.', 'woodhouse', 'family', 'governess', 'friend', 'fond', 'daughters', 'emma'], ['between', 'intimacy', 'sisters'], ['miss', 'taylor', 'office', 'governess', 'mildness', 'o']]