Parts of speech bigram counter

Parts of speech bigram counter - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: General Coding Help (https://python-forum.io/forum-8.html)
+--- Thread: Parts of speech bigram counter (/thread-2445.html)

Parts of speech bigram counter - Casper - Mar-17-2017

I'm very new to python and was looking for a language that could be used for processing large bodies of text. A friend of mine recommended Python along with the NLTK library. As part of the NLTK (natural language tool kit) book i have an input text consisting of thousands of words ("austen-emma.txt"). I'm trying to write a function that returns the most common "parts of speech (POS) bi-gram" in the text. How do i count them and iterate so the return value is a single tuple being the most common POS bigram in the text? As you can see in the output I'm getting the first value "NOUN" correct but a value count for the second element in the tuple instead of the required "full stop".

def parts_of_speech_bigram_counter(austingEmmaText):

    """Return the most common "parts of speech" bigram    """   

    tokens = nltk.word_tokenize(austingEmmaText)

    partsOfSpeech = nltk.pos_tag(tokens, tagset="universal")

    bigrams = list(nltk.bigrams(partsOfSpeech))        

    counts = Counter(tag for word, tag in bigrams for tag in tag)

    s = counts.most_common(1)
    
    return s

So the required output is ('NOUN', '.') but I'm getting [('NOUN', 51)].

Output:Failed example:
    parts_of_speech_bigram_counter(emma[:1500])
Expected:
    ('NOUN', '.')
Got:
    [('NOUN', 51)]

Example of the output not using "most_common()" method.

Output:
   Counter({'NOUN': 51, 'ADP': 27, '.': 26, 'VERB': 25, 'DET': 17, 'ADJ': 16, ',': 13, 'PRON': 12, 'of': 12, 'ADV': 11, 'CONJ': 8, 'the': 7, 'had': 7, 'a': 6, 'PRT': 6, 'and': 5, 'her': 5, 'to': 4, 'in': 4, 'Emma': 3, 'NUM': 3, ';': 3, 'very': 3, 'been': 3, 'governess': 3, 'by': 2, 'I': 2, 'Woodhouse': 2, 'with': 2, 'years': 2, 'little': 2, 'was': 2, 'daughters': 2, "'s": 2, 'mother': 2, 'more': 2, 'than': 2, 'an': 2, 'as': 2, 'Miss': 2, 'Taylor': 2, 'Jane': 1, 'Austen': 1, '1816': 1, ']': 1, 'VOLUME': 1, 'CHAPTER': 1, 'handsome': 1, 'clever': 1, 'rich': 1, 'comfortable': 1, 'home': 1, 'happy': 1, 'disposition': 1, 'seemed': 1, 'unite': 1, 'some': 1, 'best': 1, 'blessings': 1, 'existence': 1, 'lived': 1, 'nearly': 1, 'twenty-one': 1, 'world': 1, 'distress': 1, 'or': 1, 'vex': 1, 'She': 1, 'youngest': 1, 'two': 1, 'most': 1, 'affectionate': 1, 'indulgent': 1, 'father': 1, 'consequence': 1, 'sister': 1, 'marriage': 1, 'mistress': 1, 'his': 1, 'house': 1, 'from': 1, 'early': 1, 'period': 1, 'Her': 1, 'died': 1, 'too': 1, 'long': 1, 'ago': 1, 'for': 1, 'have': 1, 'indistinct': 1, 'remembrance': 1, 'caresses': 1, 'place': 1, 'supplied': 1, 'excellent': 1, 'woman': 1, 'who': 1, 'fallen': 1, 'short': 1, 'affection': 1, 'Sixteen': 1, 'Mr.': 1, 'family': 1, 'less': 1, 'friend': 1, 'fond': 1, 'both': 1, 'but': 1, 'particularly': 1, 'Between': 1, '_them_': 1, 'it': 1, 'intimacy': 1, 'sisters': 1, 'Even': 1, 'before': 1, 'ceased': 1, 'hold': 1, 'nominal': 1, 'office': 1, 'mildness': 1, 'o': 1})

RE: Parts of speech bigram counter - Larz60+ - Mar-17-2017

As an aid to testing the code:
The text file is available here: https://github.com/fbkarsdorp/python-course/blob/master/data/austen-emma.txt

RE: Parts of speech bigram counter - zivoni - Mar-17-2017

Your bigrams list is a list of tuples of pos tuples in the form ((word1, pos1), (word2, pos2)) and you need to "convert" these tuples to (pos1, pos2). In your counter you are counting just word2 and pos2 elements, not tuples (pos1, pos2). Slight modification of your code with bigram_pos "extracting" (pos1, pos2) tuples from bigrams list.

import nltk
from collections import Counter

def pos_bigram_counter(text):
    tokens = nltk.word_tokenize(text)
    pos = nltk.pos_tag(tokens, tagset="universal")
    bigrams = list(nltk.bigrams(pos))

    bigram_pos = ((pos1, pos2) for (w1, pos1), (w2, pos2) in bigrams)
    return Counter(bigram_pos).most_common(1)[0][0]

gives

Output:In [3]: emma = nltk.corpus.gutenberg.raw('austen-emma.txt')

In [4]: pos_bigram_counter(emma[:1500])
Out[4]: ('NOUN', '.')

for both first 1500 or 1000 chars (it seems that you have used :1000 for your last output).

RE: Parts of speech bigram counter - Casper - Mar-18-2017

Plus 1 rep. Thanks, I understand how to do it now.

Sorry zivoni could you explain a little bit more about the [0][0] at the end of the return counter?

RE: Parts of speech bigram counter - zivoni - Mar-18-2017

Counter's .most_common(n) method returns a list with n most common elements together with their counts (even if you ask just for one most common element with most_common(1), it still returns a list - with one tuple).

So

Counter(bigram_pos).most_common(1)

returns for your data

Output:
[(('NOUN', '.'), 22)]

It is a list with one tuple and that tuple contains element ( your desired tuple) and its count. First [0] selects first tuple from list - ( ('NOUN', '.'), 22) ) - it is the tuple with most common element and its count, but you are interested only in ('NOUN', '.') tuple, so you need to use [0] second time to select it.