Mar-17-2017, 05:17 AM
I'm very new to python and was looking for a language that could be used for processing large bodies of text. A friend of mine recommended Python along with the NLTK library. As part of the NLTK (natural language tool kit) book i have an input text consisting of thousands of words ("austen-emma.txt"). I'm trying to write a function that returns the most common "parts of speech (POS) bi-gram" in the text. How do i count them and iterate so the return value is a single tuple being the most common POS bigram in the text? As you can see in the output I'm getting the first value "NOUN" correct but a value count for the second element in the tuple instead of the required "full stop".
def parts_of_speech_bigram_counter(austingEmmaText): """Return the most common "parts of speech" bigram """ tokens = nltk.word_tokenize(austingEmmaText) partsOfSpeech = nltk.pos_tag(tokens, tagset="universal") bigrams = list(nltk.bigrams(partsOfSpeech)) counts = Counter(tag for word, tag in bigrams for tag in tag) s = counts.most_common(1) return sSo the required output is ('NOUN', '.') but I'm getting [('NOUN', 51)].
Output:Failed example:
parts_of_speech_bigram_counter(emma[:1500])
Expected:
('NOUN', '.')
Got:
[('NOUN', 51)]
Example of the output not using "most_common()" method.Output: Counter({'NOUN': 51, 'ADP': 27, '.': 26, 'VERB': 25, 'DET': 17, 'ADJ': 16, ',': 13, 'PRON': 12, 'of': 12, 'ADV': 11, 'CONJ': 8, 'the': 7, 'had': 7, 'a': 6, 'PRT': 6, 'and': 5, 'her': 5, 'to': 4, 'in': 4, 'Emma': 3, 'NUM': 3, ';': 3, 'very': 3, 'been': 3, 'governess': 3, 'by': 2, 'I': 2, 'Woodhouse': 2, 'with': 2, 'years': 2, 'little': 2, 'was': 2, 'daughters': 2, "'s": 2, 'mother': 2, 'more': 2, 'than': 2, 'an': 2, 'as': 2, 'Miss': 2, 'Taylor': 2, 'Jane': 1, 'Austen': 1, '1816': 1, ']': 1, 'VOLUME': 1, 'CHAPTER': 1, 'handsome': 1, 'clever': 1, 'rich': 1, 'comfortable': 1, 'home': 1, 'happy': 1, 'disposition': 1, 'seemed': 1, 'unite': 1, 'some': 1, 'best': 1, 'blessings': 1, 'existence': 1, 'lived': 1, 'nearly': 1, 'twenty-one': 1, 'world': 1, 'distress': 1, 'or': 1, 'vex': 1, 'She': 1, 'youngest': 1, 'two': 1, 'most': 1, 'affectionate': 1, 'indulgent': 1, 'father': 1, 'consequence': 1, 'sister': 1, 'marriage': 1, 'mistress': 1, 'his': 1, 'house': 1, 'from': 1, 'early': 1, 'period': 1, 'Her': 1, 'died': 1, 'too': 1, 'long': 1, 'ago': 1, 'for': 1, 'have': 1, 'indistinct': 1, 'remembrance': 1, 'caresses': 1, 'place': 1, 'supplied': 1, 'excellent': 1, 'woman': 1, 'who': 1, 'fallen': 1, 'short': 1, 'affection': 1, 'Sixteen': 1, 'Mr.': 1, 'family': 1, 'less': 1, 'friend': 1, 'fond': 1, 'both': 1, 'but': 1, 'particularly': 1, 'Between': 1, '_them_': 1, 'it': 1, 'intimacy': 1, 'sisters': 1, 'Even': 1, 'before': 1, 'ceased': 1, 'hold': 1, 'nominal': 1, 'office': 1, 'mildness': 1, 'o': 1})