Python Forum
[NLTK] How to calculate lexical diversity scores in Jupyter?
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
[NLTK] How to calculate lexical diversity scores in Jupyter?
#1
Hi Python/NLTK mentors,

Here is the task that I'm trying to implement in Jupyter Notebook:

Compare the lexical diversity scores for all 15 text categories in the Brown Corpus.

Which genre is more lexically diverse?

Please check my code below if this is how to calculate lexical diversity scores? Thank you in advance.

import nltk
import nltk.corpus
from nltk.book import *
from nltk.corpus import brown
print ("Genre Lexical diversity")
for cat in nltk.corpus.brown.categories():
    words_in_cat = nltk.corpus.brown.words(categories=cat)
    print (cat, "{:3g}".format( float(len(words_in_cat)) / len(set(words_in_cat)) ))
print (" ")
[Image: 02_zpsxawm7nuk.png]
Blockchain Visionary & Aspiring Encipher/Software Developer
me = {'Python Learner' : 'Beginner\'s Level'}
http://bit.ly/JoinMeOnYouTube
Reply
#2
This is modified (from O'Reilly 'Natural Language Processing and Python' page 9)
import nltk
import nltk.corpus
from nltk.book import *
from nltk.corpus import brown

def lexical_diversity(text):
    return len(text) / len(set(text))

def percentage(count, total):
    return 100 * count / total
    
for cat in nltk.corpus.brown.categories():
    print(lexical_diversity(cat))
results:
Output:
*** Introductory Examples for the NLTK Book *** Loading text1, ..., text9 and sent1, ..., sent9 Type the name of the text or sentence to view it. Type: 'texts()' or 'sents()' to list the materials. text1: Moby Dick by Herman Melville 1851 text2: Sense and Sensibility by Jane Austen 1811 text3: The Book of Genesis text4: Inaugural Address Corpus text5: Chat Corpus text6: Monty Python and the Holy Grail text7: Wall Street Journal text8: Personals Corpus text9: The Man Who Was Thursday by G . K . Chesterton 1908 1.125 2.0 1.125 1.1666666666666667 1.25 1.1666666666666667 1.0 1.1666666666666667 1.0 1.1666666666666667 1.0 1.1428571428571428 1.1666666666666667 1.0 1.6666666666666667 Larz60p@linux-nnem: NltkPlay:$
Reply
#3
(Aug-31-2018, 02:40 PM)Larz60+ Wrote: This is modified (from O'Reilly 'Natural Language Processing and Python' page 9)
import nltk
import nltk.corpus
from nltk.book import *
from nltk.corpus import brown

def lexical_diversity(text):
    return len(text) / len(set(text))

def percentage(count, total):
    return 100 * count / total
    
for cat in nltk.corpus.brown.categories():
    print(lexical_diversity(cat))

Thank you, Larz60+! I changed the code a little bit to get the same decimal points from Table 1.1 here? => https://www.nltk.org/book/ch01.html#tab-brown-types

I also noticed that the output of the code (see photo below) is different from table 1.1 from the link above.

Do you have any idea why it's different? Thank you


import nltk
import nltk.corpus
from nltk.book import *
from nltk.corpus import brown

print('Genre Lexical diversity:')
def lexical_diversity(text):
    return len(text) / len(set(text))

for cat in nltk.corpus.brown.categories():
    print(cat, (lexical_diversity(cat) / 10))
The output from the code above.

Output:
Genre Lexical diversity: adventure 0.1125 belles_lettres 0.2 editorial 0.1125 fiction 0.11666666666666667 government 0.125 hobbies 0.11666666666666667 humor 0.1 learned 0.11666666666666667 lore 0.1 mystery 0.11666666666666667 news 0.1 religion 0.11428571428571428 reviews 0.11666666666666667 romance 0.1 science_fiction 0.16666666666666669
===============================================================================================

Table 1.1 Brown Corpus Lexical Diversity for 6 genres

[Image: 03_zpsobmish2w.png]

[Image: 04_zps6v4gnm7o.png]
Blockchain Visionary & Aspiring Encipher/Software Developer
me = {'Python Learner' : 'Beginner\'s Level'}
http://bit.ly/JoinMeOnYouTube
Reply
#4
my brown corpus hasn't been updated in a long time, so I wouldn't worry about the differences.
Reply
#5
(Aug-31-2018, 04:43 PM)Larz60+ Wrote: my brown corpus hasn't been updated in a long time, so I wouldn't worry about the differences.

Awesome, thank you very much again, Larz60+!
Blockchain Visionary & Aspiring Encipher/Software Developer
me = {'Python Learner' : 'Beginner\'s Level'}
http://bit.ly/JoinMeOnYouTube
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Help with simple nltk Chatbot Extra 3 1,879 Jan-02-2022, 07:50 AM
Last Post: bepammoifoge
  Saving a download of stopwords (nltk) Drone4four 1 9,273 Nov-19-2020, 11:50 PM
Last Post: snippsat
  calculating and accumulating scores szap1977 2 2,165 Oct-06-2020, 12:15 PM
Last Post: jefsummers
  Installing nltk dependency Eshwar 0 1,827 Aug-30-2020, 06:10 PM
Last Post: Eshwar
  lexical diversity calculation AOCL1234 1 2,644 Jun-26-2020, 03:34 AM
Last Post: Larz60+
  Find Average of User Input Defined number of Scores DustinKlent 1 4,287 Oct-25-2019, 12:40 AM
Last Post: Larz60+
  Error executing Jupyter command 'notebook': [Errno 'jupyter-notebook' not found] 2 Newtopython123 10 31,289 Apr-25-2019, 07:30 AM
Last Post: banu0395
  Clean Data using NLTK disruptfwd8 0 3,322 May-12-2018, 11:21 PM
Last Post: disruptfwd8
  Text Processing and NLTK (POS tagging) TwelveMoons 2 4,894 Mar-16-2017, 02:53 AM
Last Post: TwelveMoons
  NLTK create corpora pythlang 5 10,180 Oct-26-2016, 07:31 PM
Last Post: Larz60+

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020