Jul-21-2020, 05:16 PM
I want to calculate the lexical diversity average over the course of a text. The window word length is 1000, whereas the overlap between constrained text increments is 500, e.g. [0:999], [500:1499], [1000:1999], etc. Below, first off, the function to calculate the slice total for any given text is defined; that outcome will be used to calculate the average lexical diversity.
The objective now is to assess the lexical diversity of these above-specified increments (i.e. 1000 window word lengths). To constrain once, twice, or three times individual text length is with ease achievable via [#:#]. The puzzle I have yet to piece together is how to constrain the text assessment by 1000 window word length increments without writing them all out. Text 1 text length is, for instance, over 200,000 tokens; to write out all increments, e.g. [0:999], [500:1499], [1000:1999], [1500: 2499], etc. is an onerous, inefficient task. Below is the code I have so far:
I attempt to reach the correct operation/outcome by the following code. It includes 2 defined functions (i.e. lexical_diversity and lexical_diversity_average) together with the application of 1000 window word constraints. The last 2 operations result in the same outcome, which may confirm the 1000 window word constraint with 500 word overlap shared between windows. If that is the case, does the operation with '+1000' account for the entire text? My intuition is no insofar as we expect the outcome to decrease as more text is included. Does the '+1000' constraint impact the operation outcome at all?
1 2 |
def slice_total(text): return len (text) / 500 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
>>> def slice_total(text): return len (text) / 500 >>> print (slice_total(text1)) [output] 521.638 [ / output] >>> print (slice_total(text2)) [output] 283.152 [ / output] >>> print (slice_total(text3)) [output] 89.528 [ / output] >>> print (slice_total(text4)) [output] 299.594 [ / output] >>> print (slice_total(text5)) [output] 90.02 [ / output] >>> print (slice_total(text6)) [output] 33.934 [ / output] >>> print (slice_total(text7)) [output] 201.352 [ / output] >>> print (slice_total(text8)) [output] 9.734 [ / output] >>> print (slice_total(text9)) [output] 138.426 [ / output] |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
>>> def lexical_diversity(text): return len ( set (text)) / len (text) >>> lexical_diversity(text1[ 0 : 999 ]) 0.46146146146146144 >>> lexical_diversity(text1[ 0 : 999 + 1000 ]) 0.40370185092546274 >>> def lexical_diversity_average(text): return len (text) / len ( set (text)) >>> lexical_diversity_average(text1[ 0 : 999 ]) [output] 2.1670281995661607 [ / output] >>> lexical_diversity_average(text1[ 0 : 999 ][ 500 : 1499 ]) [output] 1.9192307692307693 [ / output] >>> lexical_diversity_average(text1[ 0 : 999 ][ 500 : 1499 ][ 1000 : 1999 ]) [error]Traceback (most recent call last): File "<pyshell#31>" , line 1 , in <module> lexical_diversity_average(text1[ 0 : 999 ][ 500 : 1499 ][ 1000 : 1999 ]) File "<pyshell#23>" , line 2 , in lexical_diversity_average return len (text) / len ( set (text)) ZeroDivisionError: division by zero[ / error] >>> lexical_diversity_average(text1[ 0 : 999 + 1000 ]) [output] 2.477075588599752 [ / output] >>> lexical_diversity_average(text1[ 0 : 999 ][ 500 : 1499 ]) [output] 1.9192307692307693 [ / output] >>> lexical_diversity_average(text1[ 0 : 999 ][ 500 : 1499 + 1000 ]) [output] 1.9192307692307693 [ / output] |