[NLTK] How to calculate lexical diversity scores in Jupyter?

vanicci · Aug-31-2018, 01:23 PM

Hi Python/NLTK mentors,

Here is the task that I'm trying to implement in Jupyter Notebook:

Compare the lexical diversity scores for all 15 text categories in the Brown Corpus.

Which genre is more lexically diverse?

Please check my code below if this is how to calculate lexical diversity scores? Thank you in advance.

import nltk
import nltk.corpus
from nltk.book import *
from nltk.corpus import brown

print ("Genre Lexical diversity")
for cat in nltk.corpus.brown.categories():
    words_in_cat = nltk.corpus.brown.words(categories=cat)
    print (cat, "{:3g}".format( float(len(words_in_cat)) / len(set(words_in_cat)) ))
print (" ")

[Image: 02_zpsxawm7nuk.png]

**Larz60+** · Aug-31-2018, 02:40 PM

This is modified (from O'Reilly 'Natural Language Processing and Python' page 9)

import nltk
import nltk.corpus
from nltk.book import *
from nltk.corpus import brown

def lexical_diversity(text):
    return len(text) / len(set(text))

def percentage(count, total):
    return 100 * count / total
    
for cat in nltk.corpus.brown.categories():
    print(lexical_diversity(cat))

results:

Output:*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908
1.125
2.0
1.125
1.1666666666666667
1.25
1.1666666666666667
1.0
1.1666666666666667
1.0
1.1666666666666667
1.0
1.1428571428571428
1.1666666666666667
1.0
1.6666666666666667
Larz60p@linux-nnem: NltkPlay:$

vanicci · Aug-31-2018, 03:48 PM

(Aug-31-2018, 02:40 PM)Larz60+ Wrote: This is modified (from O'Reilly 'Natural Language Processing and Python' page 9)

import nltk
import nltk.corpus
from nltk.book import *
from nltk.corpus import brown

def lexical_diversity(text):
    return len(text) / len(set(text))

def percentage(count, total):
    return 100 * count / total
    
for cat in nltk.corpus.brown.categories():
    print(lexical_diversity(cat))

Thank you, Larz60+! I changed the code a little bit to get the same decimal points from Table 1.1 here? => https://www.nltk.org/book/ch01.html#tab-brown-types

I also noticed that the output of the code (see photo below) is different from table 1.1 from the link above.

Do you have any idea why it's different? Thank you

import nltk
import nltk.corpus
from nltk.book import *
from nltk.corpus import brown

print('Genre Lexical diversity:')
def lexical_diversity(text):
    return len(text) / len(set(text))

for cat in nltk.corpus.brown.categories():
    print(cat, (lexical_diversity(cat) / 10))

The output from the code above.

Output:Genre Lexical diversity:
adventure 0.1125
belles_lettres 0.2
editorial 0.1125
fiction 0.11666666666666667
government 0.125
hobbies 0.11666666666666667
humor 0.1
learned 0.11666666666666667
lore 0.1
mystery 0.11666666666666667
news 0.1
religion 0.11428571428571428
reviews 0.11666666666666667
romance 0.1
science_fiction 0.16666666666666669

===============================================================================================

Table 1.1 Brown Corpus Lexical Diversity for 6 genres

[Image: 03_zpsobmish2w.png]

[Image: 04_zps6v4gnm7o.png]

**Larz60+** · Aug-31-2018, 04:43 PM

my brown corpus hasn't been updated in a long time, so I wouldn't worry about the differences.

vanicci · Sep-01-2018, 09:43 AM

(Aug-31-2018, 04:43 PM)Larz60+ Wrote: my brown corpus hasn't been updated in a long time, so I wouldn't worry about the differences.

Awesome, thank you very much again, Larz60+!

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Help with simple nltk Chatbot	Extra	3	1,879	Jan-02-2022, 07:50 AM Last Post: bepammoifoge
	Saving a download of stopwords (nltk)	Drone4four	1	9,273	Nov-19-2020, 11:50 PM Last Post: snippsat
	calculating and accumulating scores	szap1977	2	2,165	Oct-06-2020, 12:15 PM Last Post: jefsummers
	Installing nltk dependency	Eshwar	0	1,827	Aug-30-2020, 06:10 PM Last Post: Eshwar
	lexical diversity calculation	AOCL1234	1	2,644	Jun-26-2020, 03:34 AM Last Post: Larz60+
	Find Average of User Input Defined number of Scores	DustinKlent	1	4,287	Oct-25-2019, 12:40 AM Last Post: Larz60+
	Error executing Jupyter command 'notebook': [Errno 'jupyter-notebook' not found] 2	Newtopython123	10	31,289	Apr-25-2019, 07:30 AM Last Post: banu0395
	Clean Data using NLTK	disruptfwd8	0	3,322	May-12-2018, 11:21 PM Last Post: disruptfwd8
	Text Processing and NLTK (POS tagging)	TwelveMoons	2	4,894	Mar-16-2017, 02:53 AM Last Post: TwelveMoons
	NLTK create corpora	pythlang	5	10,180	Oct-26-2016, 07:31 PM Last Post: Larz60+

[NLTK] How to calculate lexical diversity scores in Jupyter?

User Panel Messages

Announcements