Python Forum
Parts of speech bigram counter
Thread Rating:
  • 1 Vote(s) - 4 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Parts of speech bigram counter
#1
I'm very new to python and was looking for a language that could be used for processing large bodies of text. A friend of mine recommended Python along with the NLTK library. As part of the NLTK (natural language tool kit) book i have an input text consisting of thousands of words ("austen-emma.txt"). I'm trying to write a function that returns the most common "parts of speech (POS) bi-gram" in the text. How do i count them and iterate so the return value is a single tuple being the most common POS bigram in the text? As you can see in the output I'm getting the first value "NOUN" correct but a value count for the second element in the tuple instead of the required "full stop". 

def parts_of_speech_bigram_counter(austingEmmaText):

    """Return the most common "parts of speech" bigram    """   

    tokens = nltk.word_tokenize(austingEmmaText)

    partsOfSpeech = nltk.pos_tag(tokens, tagset="universal")

    bigrams = list(nltk.bigrams(partsOfSpeech))        

    counts = Counter(tag for word, tag in bigrams for tag in tag)

    s = counts.most_common(1)
    
    return s
So the required output is ('NOUN', '.') but I'm getting [('NOUN', 51)].  
Output:
Failed example:     parts_of_speech_bigram_counter(emma[:1500]) Expected:     ('NOUN', '.') Got:     [('NOUN', 51)]
Example of the output not using  "most_common()" method.
Output:
   Counter({'NOUN': 51, 'ADP': 27, '.': 26, 'VERB': 25, 'DET': 17, 'ADJ': 16, ',': 13, 'PRON': 12, 'of': 12, 'ADV': 11, 'CONJ': 8, 'the': 7, 'had': 7, 'a': 6, 'PRT': 6, 'and': 5, 'her': 5, 'to': 4, 'in': 4, 'Emma': 3, 'NUM': 3, ';': 3, 'very': 3, 'been': 3, 'governess': 3, 'by': 2, 'I': 2, 'Woodhouse': 2, 'with': 2, 'years': 2, 'little': 2, 'was': 2, 'daughters': 2, "'s": 2, 'mother': 2, 'more': 2, 'than': 2, 'an': 2, 'as': 2, 'Miss': 2, 'Taylor': 2, 'Jane': 1, 'Austen': 1, '1816': 1, ']': 1, 'VOLUME': 1, 'CHAPTER': 1, 'handsome': 1, 'clever': 1, 'rich': 1, 'comfortable': 1, 'home': 1, 'happy': 1, 'disposition': 1, 'seemed': 1, 'unite': 1, 'some': 1, 'best': 1, 'blessings': 1, 'existence': 1, 'lived': 1, 'nearly': 1, 'twenty-one': 1, 'world': 1, 'distress': 1, 'or': 1, 'vex': 1, 'She': 1, 'youngest': 1, 'two': 1, 'most': 1, 'affectionate': 1, 'indulgent': 1, 'father': 1, 'consequence': 1, 'sister': 1, 'marriage': 1, 'mistress': 1, 'his': 1, 'house': 1, 'from': 1, 'early': 1, 'period': 1, 'Her': 1, 'died': 1, 'too': 1, 'long': 1, 'ago': 1, 'for': 1, 'have': 1, 'indistinct': 1, 'remembrance': 1, 'caresses': 1, 'place': 1, 'supplied': 1, 'excellent': 1, 'woman': 1, 'who': 1, 'fallen': 1, 'short': 1, 'affection': 1, 'Sixteen': 1, 'Mr.': 1, 'family': 1, 'less': 1, 'friend': 1, 'fond': 1, 'both': 1, 'but': 1, 'particularly': 1, 'Between': 1, '_them_': 1, 'it': 1, 'intimacy': 1, 'sisters': 1, 'Even': 1, 'before': 1, 'ceased': 1, 'hold': 1, 'nominal': 1, 'office': 1, 'mildness': 1, 'o': 1})
Reply
#2
As an aid to testing the code:
The text file is available here: https://github.com/fbkarsdorp/python-cou...n-emma.txt
Reply
#3
Your bigrams list is a list of tuples of pos tuples in the form ((word1, pos1), (word2, pos2)) and you need to "convert" these tuples to (pos1, pos2). In your counter you are counting just word2 and pos2 elements, not tuples (pos1, pos2). Slight modification of your code with bigram_pos "extracting"  (pos1, pos2) tuples from bigrams list.

import nltk
from collections import Counter

def pos_bigram_counter(text):
    tokens = nltk.word_tokenize(text)
    pos = nltk.pos_tag(tokens, tagset="universal")
    bigrams = list(nltk.bigrams(pos))

    bigram_pos = ((pos1, pos2) for (w1, pos1), (w2, pos2) in bigrams)
    return Counter(bigram_pos).most_common(1)[0][0]
gives
Output:
In [3]: emma = nltk.corpus.gutenberg.raw('austen-emma.txt') In [4]: pos_bigram_counter(emma[:1500]) Out[4]: ('NOUN', '.')
for both first 1500 or 1000 chars (it seems that you have used :1000 for your last output).
Reply
#4
Plus 1 rep. Thanks, I understand how to do it now.

Sorry zivoni could you explain a little bit more about the [0][0] at the end of the return counter?
Reply
#5
Counter's .most_common(n) method returns a list with n most common elements together with their counts (even if you ask just for one most common element with most_common(1), it still returns a list - with one tuple).

So
Counter(bigram_pos).most_common(1)
returns for your data
Output:
[(('NOUN', '.'), 22)]
It is a list with one tuple and that tuple contains element ( your desired tuple) and its count. First [0] selects first tuple from list - ( ('NOUN', '.'), 22) ) - it is the tuple with most common element and its count, but you are interested only in ('NOUN', '.') tuple, so you need to use [0] second time to select it.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  How to split a bigram? cheburashka 1 1,243 Dec-12-2021, 06:55 AM
Last Post: DPaul
  Speech Recognition with timestamps DeanAseraf1 3 6,502 Jun-27-2021, 06:58 PM
Last Post: gh_ad
  Continous Speech Recognition dell91 0 1,809 Oct-29-2020, 10:51 AM
Last Post: dell91
  text to speech Heyjoe 11 6,633 Jul-02-2020, 01:32 AM
Last Post: Heyjoe
  Googles Text to speech justindiaz7474 0 1,629 May-06-2020, 02:04 AM
Last Post: justindiaz7474
  Python Speech Engines? Robo_Pi 2 2,059 Mar-12-2020, 02:46 PM
Last Post: Robo_Pi
  Speech Recognition Ash23733 1 8,531 Dec-12-2018, 10:00 PM
Last Post: nilamo
  Need Help With Text to Speech App Lethe 0 1,955 Oct-24-2018, 10:03 PM
Last Post: Lethe
  Using Windows Speech to Text jmair 2 3,111 May-08-2018, 01:40 PM
Last Post: jmair
  API Google Speech kowalskilinux 0 2,585 Feb-11-2018, 04:47 PM
Last Post: kowalskilinux

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020