Python Forum

Full Version: How to calculate the lexical diversity average (with 1000 window word length)
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
I want to calculate the lexical diversity average over the course of a text. The window word length is 1000, whereas the overlap between constrained text increments is 500, e.g. [0:999], [500:1499], [1000:1999], etc. Below, first off, the function to calculate the slice total for any given text is defined; that outcome will be used to calculate the average lexical diversity.

def slice_total(text):
    return len(text) / 500 
The objective now is to assess the lexical diversity of these above-specified increments (i.e. 1000 window word lengths). To constrain once, twice, or three times individual text length is with ease achievable via [#:#]. The puzzle I have yet to piece together is how to constrain the text assessment by 1000 window word length increments without writing them all out. Text 1 text length is, for instance, over 200,000 tokens; to write out all increments, e.g. [0:999], [500:1499], [1000:1999], [1500: 2499], etc. is an onerous, inefficient task. Below is the code I have so far:

>>> def slice_total(text):
    return len(text) / 500

>>> print(slice_total(text1))
[output]521.638[/output]
>>> print(slice_total(text2))
[output]283.152[/output]
>>> print(slice_total(text3))
[output]89.528[/output]
>>> print(slice_total(text4))
[output]299.594[/output]
>>> print(slice_total(text5))
[output]90.02[/output]
>>> print(slice_total(text6))
[output]33.934[/output]
>>> print(slice_total(text7))
[output]201.352[/output]
>>> print(slice_total(text8))
[output]9.734[/output]
>>> print(slice_total(text9))
[output]138.426[/output]
I attempt to reach the correct operation/outcome by the following code. It includes 2 defined functions (i.e. lexical_diversity and lexical_diversity_average) together with the application of 1000 window word constraints. The last 2 operations result in the same outcome, which may confirm the 1000 window word constraint with 500 word overlap shared between windows. If that is the case, does the operation with '+1000' account for the entire text? My intuition is no insofar as we expect the outcome to decrease as more text is included. Does the '+1000' constraint impact the operation outcome at all?

>>> def lexical_diversity(text):
	return len(set(text))/len(text)

>>> lexical_diversity(text1[0:999])
0.46146146146146144
>>> lexical_diversity(text1[0:999+1000])
0.40370185092546274
>>> def lexical_diversity_average(text):
	return len(text)/len(set(text))

>>> lexical_diversity_average(text1[0:999])
[output]2.1670281995661607[/output]
>>> lexical_diversity_average(text1[0:999][500:1499])
[output]1.9192307692307693[/output]
>>> lexical_diversity_average(text1[0:999][500:1499][1000:1999])
[error]Traceback (most recent call last):
  File "<pyshell#31>", line 1, in <module>
    lexical_diversity_average(text1[0:999][500:1499][1000:1999])
  File "<pyshell#23>", line 2, in lexical_diversity_average
    return len(text)/len(set(text))
ZeroDivisionError: division by zero[/error]
>>> lexical_diversity_average(text1[0:999+1000])
[output]2.477075588599752 [/output]
>>> lexical_diversity_average(text1[0:999][500:1499])
[output]1.9192307692307693[/output]
>>> lexical_diversity_average(text1[0:999][500:1499+1000])
[output]1.9192307692307693[/output]
This might be a first step:
for x in range(0,200001,500):
    print(x, x + 999)
Paul
The package more_itertools could help.

from more_itertools import windowed


window_size = 10
window_step = 5
words = range(100)

for window in windowed(words, n=window_size, step=window_step):
    print(window)
Output:
(0, 1, 2, 3, 4, 5, 6, 7, 8, 9) (5, 6, 7, 8, 9, 10, 11, 12, 13, 14) (10, 11, 12, 13, 14, 15, 16, 17, 18, 19) (15, 16, 17, 18, 19, 20, 21, 22, 23, 24) (20, 21, 22, 23, 24, 25, 26, 27, 28, 29) (25, 26, 27, 28, 29, 30, 31, 32, 33, 34) (30, 31, 32, 33, 34, 35, 36, 37, 38, 39) (35, 36, 37, 38, 39, 40, 41, 42, 43, 44) (40, 41, 42, 43, 44, 45, 46, 47, 48, 49) (45, 46, 47, 48, 49, 50, 51, 52, 53, 54) (50, 51, 52, 53, 54, 55, 56, 57, 58, 59) (55, 56, 57, 58, 59, 60, 61, 62, 63, 64) (60, 61, 62, 63, 64, 65, 66, 67, 68, 69) (65, 66, 67, 68, 69, 70, 71, 72, 73, 74) (70, 71, 72, 73, 74, 75, 76, 77, 78, 79) (75, 76, 77, 78, 79, 80, 81, 82, 83, 84) (80, 81, 82, 83, 84, 85, 86, 87, 88, 89) (85, 86, 87, 88, 89, 90, 91, 92, 93, 94) (90, 91, 92, 93, 94, 95, 96, 97, 98, 99)
Instead of printing it to console, you can process the slices.
Thank you both for your useful responses. DPaul, the code you proposed is effective. I have a related question: is there anyway I can write a piece of code, which would entail
text[m:m+1000]
, to achieve the same end? This piece of code was suggested in the assignment. From my perspective, the two m variables in such a coding context represent the portion of text. For instance, [0:999] generates tokens 1-1000. Or maybe it should be
[0:499]
since text portions are required to be 1000 token increments with a 500 token window (i.e. 1499 tokens).
[0:999+1000]
results in 1999 tokens. The more central puzzle here is that this code applies only to one portion of the text, which falls short of what is required. Your suggestion
for x in range(0:200001, 500) print (x, x + 999)
was effective. So, how would I write a function including
[m:m+1000]
, which would generate a list, each one of which is a lexical diversity measurement?

Here is the lexical diversity function I created:

def lexical_diversity(text):
    return len(set(text))/len(text)
Now I must apply this operation to the text increments, using
[m:m+1000]
.

Here is once again the slice total function I created: When applied to a single text (as indicated in the function), it generates how many slices each text will entail (to which lexical diversity operations will be applied).

def slice_total(text):
    return len(text) / 500
I'm a bit confused with the terminology used, tokens, windows words, slices of letters....
But this is what i did: I created a mock text, in this case 3000 string numbers, so it is easy to see the overlap.
And then sliced it up into 999 letter segments, based on what i proposed in post # 2.
Instead of printing the segment, do your calculations on it.
Hope this helps.
Paul

totalText = ''
for x in range(3000):
    totalText += str(x) + ' '

for x in range(0,len(totalText),500):
    slice = totalText[x:x+999]
    print(f'Slice length: {len(slice)}')
    print(slice)
DPaul, thank you! You have provided me with very helpful info. I have tried your suggestions out and they are effective.

A puzzle still remains, however. The assignment asks to create *a function,* namely "a function that given a text returns a list that consists of the lexical diversity measures for all of the slices."

So, the fundamental question here is how do I generate a list of slices to which the lexical diversity function can be applied (see below), thus generating a list of lexical diversity measures for all slices. This is quite frustrating because I understand intuitively the step by step process (i.e. how to generate a list of slices and to generate lexical diversity), however don't know how to create a function to apply the one to the other. Thank you again; I am learning a great deal by trial and error.

What I have so far:

(1) The function I created to generate lexical diversity is as follows.

def lexical_diversity(text):
    return len(set(text))/len(text)
(2) The range operation you proposed works well to generate slices. I modified it below to apply to text1. How would I now apply the lexical diversity operation to each slice?

for n in range(0, len(text1), 500):
	slice = text1[n:n+1000]
	print(slice)
The output was as follows:
Output:
Squeezed text(98 lines). Squeezed text(97 lines). Squeezed text(96 lines).
etc.
Making a list of slices should be a "piece of cake" (sorry).
Try this:
mySlices = ['a','b','c']

def addSlice(sl):
    global mySlices
    mySlices.append(sl)
    print(mySlices)

addSlice('d')
Paul