Python Forum
How to calculate the lexical diversity average (with 1000 window word length)
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
How to calculate the lexical diversity average (with 1000 window word length)
#1
I want to calculate the lexical diversity average over the course of a text. The window word length is 1000, whereas the overlap between constrained text increments is 500, e.g. [0:999], [500:1499], [1000:1999], etc. Below, first off, the function to calculate the slice total for any given text is defined; that outcome will be used to calculate the average lexical diversity.

1
2
def slice_total(text):
    return len(text) / 500
The objective now is to assess the lexical diversity of these above-specified increments (i.e. 1000 window word lengths). To constrain once, twice, or three times individual text length is with ease achievable via [#:#]. The puzzle I have yet to piece together is how to constrain the text assessment by 1000 window word length increments without writing them all out. Text 1 text length is, for instance, over 200,000 tokens; to write out all increments, e.g. [0:999], [500:1499], [1000:1999], [1500: 2499], etc. is an onerous, inefficient task. Below is the code I have so far:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
>>> def slice_total(text):
    return len(text) / 500
 
>>> print(slice_total(text1))
[output]521.638[/output]
>>> print(slice_total(text2))
[output]283.152[/output]
>>> print(slice_total(text3))
[output]89.528[/output]
>>> print(slice_total(text4))
[output]299.594[/output]
>>> print(slice_total(text5))
[output]90.02[/output]
>>> print(slice_total(text6))
[output]33.934[/output]
>>> print(slice_total(text7))
[output]201.352[/output]
>>> print(slice_total(text8))
[output]9.734[/output]
>>> print(slice_total(text9))
[output]138.426[/output]
I attempt to reach the correct operation/outcome by the following code. It includes 2 defined functions (i.e. lexical_diversity and lexical_diversity_average) together with the application of 1000 window word constraints. The last 2 operations result in the same outcome, which may confirm the 1000 window word constraint with 500 word overlap shared between windows. If that is the case, does the operation with '+1000' account for the entire text? My intuition is no insofar as we expect the outcome to decrease as more text is included. Does the '+1000' constraint impact the operation outcome at all?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
>>> def lexical_diversity(text):
    return len(set(text))/len(text)
 
>>> lexical_diversity(text1[0:999])
0.46146146146146144
>>> lexical_diversity(text1[0:999+1000])
0.40370185092546274
>>> def lexical_diversity_average(text):
    return len(text)/len(set(text))
 
>>> lexical_diversity_average(text1[0:999])
[output]2.1670281995661607[/output]
>>> lexical_diversity_average(text1[0:999][500:1499])
[output]1.9192307692307693[/output]
>>> lexical_diversity_average(text1[0:999][500:1499][1000:1999])
[error]Traceback (most recent call last):
  File "<pyshell#31>", line 1, in <module>
    lexical_diversity_average(text1[0:999][500:1499][1000:1999])
  File "<pyshell#23>", line 2, in lexical_diversity_average
    return len(text)/len(set(text))
ZeroDivisionError: division by zero[/error]
>>> lexical_diversity_average(text1[0:999+1000])
[output]2.477075588599752 [/output]
>>> lexical_diversity_average(text1[0:999][500:1499])
[output]1.9192307692307693[/output]
>>> lexical_diversity_average(text1[0:999][500:1499+1000])
[output]1.9192307692307693[/output]
Reply
#2
This might be a first step:
1
2
for x in range(0,200001,500):
    print(x, x + 999)
Paul
It is more important to do the right thing, than to do the thing right.(P.Drucker)
Better is the enemy of good. (Montesquieu) = French version for 'kiss'.
Reply
#3
The package more_itertools could help.

1
2
3
4
5
6
7
8
9
from more_itertools import windowed
 
 
window_size = 10
window_step = 5
words = range(100)
 
for window in windowed(words, n=window_size, step=window_step):
    print(window)
Output:
(0, 1, 2, 3, 4, 5, 6, 7, 8, 9) (5, 6, 7, 8, 9, 10, 11, 12, 13, 14) (10, 11, 12, 13, 14, 15, 16, 17, 18, 19) (15, 16, 17, 18, 19, 20, 21, 22, 23, 24) (20, 21, 22, 23, 24, 25, 26, 27, 28, 29) (25, 26, 27, 28, 29, 30, 31, 32, 33, 34) (30, 31, 32, 33, 34, 35, 36, 37, 38, 39) (35, 36, 37, 38, 39, 40, 41, 42, 43, 44) (40, 41, 42, 43, 44, 45, 46, 47, 48, 49) (45, 46, 47, 48, 49, 50, 51, 52, 53, 54) (50, 51, 52, 53, 54, 55, 56, 57, 58, 59) (55, 56, 57, 58, 59, 60, 61, 62, 63, 64) (60, 61, 62, 63, 64, 65, 66, 67, 68, 69) (65, 66, 67, 68, 69, 70, 71, 72, 73, 74) (70, 71, 72, 73, 74, 75, 76, 77, 78, 79) (75, 76, 77, 78, 79, 80, 81, 82, 83, 84) (80, 81, 82, 83, 84, 85, 86, 87, 88, 89) (85, 86, 87, 88, 89, 90, 91, 92, 93, 94) (90, 91, 92, 93, 94, 95, 96, 97, 98, 99)
Instead of printing it to console, you can process the slices.
Almost dead, but too lazy to die: https://sourceserver.info
All humans together. We don't need politicians!
Reply
#4
Thank you both for your useful responses. DPaul, the code you proposed is effective. I have a related question: is there anyway I can write a piece of code, which would entail
1
text[m:m+1000]
, to achieve the same end? This piece of code was suggested in the assignment. From my perspective, the two m variables in such a coding context represent the portion of text. For instance, [0:999] generates tokens 1-1000. Or maybe it should be
1
[0:499]
since text portions are required to be 1000 token increments with a 500 token window (i.e. 1499 tokens).
1
[0:999+1000]
results in 1999 tokens. The more central puzzle here is that this code applies only to one portion of the text, which falls short of what is required. Your suggestion
1
for x in range(0:200001, 500) print (x, x + 999)
was effective. So, how would I write a function including
1
[m:m+1000]
, which would generate a list, each one of which is a lexical diversity measurement?

Here is the lexical diversity function I created:

1
2
def lexical_diversity(text):
    return len(set(text))/len(text)
Now I must apply this operation to the text increments, using
1
[m:m+1000]
.

Here is once again the slice total function I created: When applied to a single text (as indicated in the function), it generates how many slices each text will entail (to which lexical diversity operations will be applied).

1
2
def slice_total(text):
    return len(text) / 500
Reply
#5
I'm a bit confused with the terminology used, tokens, windows words, slices of letters....
But this is what i did: I created a mock text, in this case 3000 string numbers, so it is easy to see the overlap.
And then sliced it up into 999 letter segments, based on what i proposed in post # 2.
Instead of printing the segment, do your calculations on it.
Hope this helps.
Paul

1
2
3
4
5
6
7
8
totalText = ''
for x in range(3000):
    totalText += str(x) + ' '
 
for x in range(0,len(totalText),500):
    slice = totalText[x:x+999]
    print(f'Slice length: {len(slice)}')
    print(slice)
It is more important to do the right thing, than to do the thing right.(P.Drucker)
Better is the enemy of good. (Montesquieu) = French version for 'kiss'.
Reply
#6
DPaul, thank you! You have provided me with very helpful info. I have tried your suggestions out and they are effective.

A puzzle still remains, however. The assignment asks to create *a function,* namely "a function that given a text returns a list that consists of the lexical diversity measures for all of the slices."

So, the fundamental question here is how do I generate a list of slices to which the lexical diversity function can be applied (see below), thus generating a list of lexical diversity measures for all slices. This is quite frustrating because I understand intuitively the step by step process (i.e. how to generate a list of slices and to generate lexical diversity), however don't know how to create a function to apply the one to the other. Thank you again; I am learning a great deal by trial and error.

What I have so far:

(1) The function I created to generate lexical diversity is as follows.

1
2
def lexical_diversity(text):
    return len(set(text))/len(text)
(2) The range operation you proposed works well to generate slices. I modified it below to apply to text1. How would I now apply the lexical diversity operation to each slice?

1
2
3
for n in range(0, len(text1), 500):
    slice = text1[n:n+1000]
    print(slice)
The output was as follows:
Output:
Squeezed text(98 lines). Squeezed text(97 lines). Squeezed text(96 lines).
etc.
Reply
#7
Making a list of slices should be a "piece of cake" (sorry).
Try this:
1
2
3
4
5
6
7
8
mySlices = ['a','b','c']
 
def addSlice(sl):
    global mySlices
    mySlices.append(sl)
    print(mySlices)
 
addSlice('d')
Paul
It is more important to do the right thing, than to do the thing right.(P.Drucker)
Better is the enemy of good. (Montesquieu) = French version for 'kiss'.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Simple Method to calculate average and grade ajitnayak1987 8 10,018 Apr-28-2022, 06:26 AM
Last Post: rayansaqer

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020