Python Forum

Full Version: [NLTK] How to calculate lexical diversity scores in Jupyter?
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hi Python/NLTK mentors,

Here is the task that I'm trying to implement in Jupyter Notebook:

Compare the lexical diversity scores for all 15 text categories in the Brown Corpus.

Which genre is more lexically diverse?

Please check my code below if this is how to calculate lexical diversity scores? Thank you in advance.

import nltk
import nltk.corpus
from nltk.book import *
from nltk.corpus import brown
print ("Genre Lexical diversity")
for cat in nltk.corpus.brown.categories():
    words_in_cat = nltk.corpus.brown.words(categories=cat)
    print (cat, "{:3g}".format( float(len(words_in_cat)) / len(set(words_in_cat)) ))
print (" ")
[Image: 02_zpsxawm7nuk.png]
This is modified (from O'Reilly 'Natural Language Processing and Python' page 9)
import nltk
import nltk.corpus
from nltk.book import *
from nltk.corpus import brown

def lexical_diversity(text):
    return len(text) / len(set(text))

def percentage(count, total):
    return 100 * count / total
    
for cat in nltk.corpus.brown.categories():
    print(lexical_diversity(cat))
results:
Output:
*** Introductory Examples for the NLTK Book *** Loading text1, ..., text9 and sent1, ..., sent9 Type the name of the text or sentence to view it. Type: 'texts()' or 'sents()' to list the materials. text1: Moby Dick by Herman Melville 1851 text2: Sense and Sensibility by Jane Austen 1811 text3: The Book of Genesis text4: Inaugural Address Corpus text5: Chat Corpus text6: Monty Python and the Holy Grail text7: Wall Street Journal text8: Personals Corpus text9: The Man Who Was Thursday by G . K . Chesterton 1908 1.125 2.0 1.125 1.1666666666666667 1.25 1.1666666666666667 1.0 1.1666666666666667 1.0 1.1666666666666667 1.0 1.1428571428571428 1.1666666666666667 1.0 1.6666666666666667 Larz60p@linux-nnem: NltkPlay:$
(Aug-31-2018, 02:40 PM)Larz60+ Wrote: [ -> ]This is modified (from O'Reilly 'Natural Language Processing and Python' page 9)
import nltk
import nltk.corpus
from nltk.book import *
from nltk.corpus import brown

def lexical_diversity(text):
    return len(text) / len(set(text))

def percentage(count, total):
    return 100 * count / total
    
for cat in nltk.corpus.brown.categories():
    print(lexical_diversity(cat))

Thank you, Larz60+! I changed the code a little bit to get the same decimal points from Table 1.1 here? => https://www.nltk.org/book/ch01.html#tab-brown-types

I also noticed that the output of the code (see photo below) is different from table 1.1 from the link above.

Do you have any idea why it's different? Thank you


import nltk
import nltk.corpus
from nltk.book import *
from nltk.corpus import brown

print('Genre Lexical diversity:')
def lexical_diversity(text):
    return len(text) / len(set(text))

for cat in nltk.corpus.brown.categories():
    print(cat, (lexical_diversity(cat) / 10))
The output from the code above.

Output:
Genre Lexical diversity: adventure 0.1125 belles_lettres 0.2 editorial 0.1125 fiction 0.11666666666666667 government 0.125 hobbies 0.11666666666666667 humor 0.1 learned 0.11666666666666667 lore 0.1 mystery 0.11666666666666667 news 0.1 religion 0.11428571428571428 reviews 0.11666666666666667 romance 0.1 science_fiction 0.16666666666666669
===============================================================================================

Table 1.1 Brown Corpus Lexical Diversity for 6 genres

[Image: 03_zpsobmish2w.png]

[Image: 04_zps6v4gnm7o.png]
my brown corpus hasn't been updated in a long time, so I wouldn't worry about the differences.
(Aug-31-2018, 04:43 PM)Larz60+ Wrote: [ -> ]my brown corpus hasn't been updated in a long time, so I wouldn't worry about the differences.

Awesome, thank you very much again, Larz60+!