Hi Python/NLTK mentors,
Here is the task that I'm trying to implement in Jupyter Notebook:
Compare the lexical diversity scores for all 15 text categories in the Brown Corpus.
Which genre is more lexically diverse?
Please check my code below if this is
how to calculate lexical diversity scores? Thank you in advance.
import nltk
import nltk.corpus
from nltk.book import *
from nltk.corpus import brown
print ("Genre Lexical diversity")
for cat in nltk.corpus.brown.categories():
words_in_cat = nltk.corpus.brown.words(categories=cat)
print (cat, "{:3g}".format( float(len(words_in_cat)) / len(set(words_in_cat)) ))
print (" ")
[Image: 02_zpsxawm7nuk.png]
This is modified (from O'Reilly 'Natural Language Processing and Python' page 9)
import nltk
import nltk.corpus
from nltk.book import *
from nltk.corpus import brown
def lexical_diversity(text):
return len(text) / len(set(text))
def percentage(count, total):
return 100 * count / total
for cat in nltk.corpus.brown.categories():
print(lexical_diversity(cat))
results:
Output:
*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908
1.125
2.0
1.125
1.1666666666666667
1.25
1.1666666666666667
1.0
1.1666666666666667
1.0
1.1666666666666667
1.0
1.1428571428571428
1.1666666666666667
1.0
1.6666666666666667
Larz60p@linux-nnem: NltkPlay:$
(Aug-31-2018, 02:40 PM)Larz60+ Wrote: [ -> ]This is modified (from O'Reilly 'Natural Language Processing and Python' page 9)
import nltk
import nltk.corpus
from nltk.book import *
from nltk.corpus import brown
def lexical_diversity(text):
return len(text) / len(set(text))
def percentage(count, total):
return 100 * count / total
for cat in nltk.corpus.brown.categories():
print(lexical_diversity(cat))
Thank you,
Larz60+! I changed the code a little bit to get the same decimal points from Table 1.1 here? =>
https://www.nltk.org/book/ch01.html#tab-brown-types
I also noticed that the output of the code (see photo below) is different from table 1.1 from the link above.
Do you have any idea why it's different? Thank you
import nltk
import nltk.corpus
from nltk.book import *
from nltk.corpus import brown
print('Genre Lexical diversity:')
def lexical_diversity(text):
return len(text) / len(set(text))
for cat in nltk.corpus.brown.categories():
print(cat, (lexical_diversity(cat) / 10))
The output from the code above.
Output:
Genre Lexical diversity:
adventure 0.1125
belles_lettres 0.2
editorial 0.1125
fiction 0.11666666666666667
government 0.125
hobbies 0.11666666666666667
humor 0.1
learned 0.11666666666666667
lore 0.1
mystery 0.11666666666666667
news 0.1
religion 0.11428571428571428
reviews 0.11666666666666667
romance 0.1
science_fiction 0.16666666666666669
===============================================================================================
Table 1.1 Brown Corpus Lexical Diversity for 6 genres
[Image: 03_zpsobmish2w.png]
[Image: 04_zps6v4gnm7o.png]
my brown corpus hasn't been updated in a long time, so I wouldn't worry about the differences.
(Aug-31-2018, 04:43 PM)Larz60+ Wrote: [ -> ]my brown corpus hasn't been updated in a long time, so I wouldn't worry about the differences.
Awesome, thank you very much again,
Larz60+!