Jul-21-2020, 05:16 PM
I want to calculate the lexical diversity average over the course of a text. The window word length is 1000, whereas the overlap between constrained text increments is 500, e.g. [0:999], [500:1499], [1000:1999], etc. Below, first off, the function to calculate the slice total for any given text is defined; that outcome will be used to calculate the average lexical diversity.
def slice_total(text): return len(text) / 500The objective now is to assess the lexical diversity of these above-specified increments (i.e. 1000 window word lengths). To constrain once, twice, or three times individual text length is with ease achievable via [#:#]. The puzzle I have yet to piece together is how to constrain the text assessment by 1000 window word length increments without writing them all out. Text 1 text length is, for instance, over 200,000 tokens; to write out all increments, e.g. [0:999], [500:1499], [1000:1999], [1500: 2499], etc. is an onerous, inefficient task. Below is the code I have so far:
>>> def slice_total(text): return len(text) / 500 >>> print(slice_total(text1)) [output]521.638[/output] >>> print(slice_total(text2)) [output]283.152[/output] >>> print(slice_total(text3)) [output]89.528[/output] >>> print(slice_total(text4)) [output]299.594[/output] >>> print(slice_total(text5)) [output]90.02[/output] >>> print(slice_total(text6)) [output]33.934[/output] >>> print(slice_total(text7)) [output]201.352[/output] >>> print(slice_total(text8)) [output]9.734[/output] >>> print(slice_total(text9)) [output]138.426[/output]I attempt to reach the correct operation/outcome by the following code. It includes 2 defined functions (i.e. lexical_diversity and lexical_diversity_average) together with the application of 1000 window word constraints. The last 2 operations result in the same outcome, which may confirm the 1000 window word constraint with 500 word overlap shared between windows. If that is the case, does the operation with '+1000' account for the entire text? My intuition is no insofar as we expect the outcome to decrease as more text is included. Does the '+1000' constraint impact the operation outcome at all?
>>> def lexical_diversity(text): return len(set(text))/len(text) >>> lexical_diversity(text1[0:999]) 0.46146146146146144 >>> lexical_diversity(text1[0:999+1000]) 0.40370185092546274 >>> def lexical_diversity_average(text): return len(text)/len(set(text)) >>> lexical_diversity_average(text1[0:999]) [output]2.1670281995661607[/output] >>> lexical_diversity_average(text1[0:999][500:1499]) [output]1.9192307692307693[/output] >>> lexical_diversity_average(text1[0:999][500:1499][1000:1999]) [error]Traceback (most recent call last): File "<pyshell#31>", line 1, in <module> lexical_diversity_average(text1[0:999][500:1499][1000:1999]) File "<pyshell#23>", line 2, in lexical_diversity_average return len(text)/len(set(text)) ZeroDivisionError: division by zero[/error] >>> lexical_diversity_average(text1[0:999+1000]) [output]2.477075588599752 [/output] >>> lexical_diversity_average(text1[0:999][500:1499]) [output]1.9192307692307693[/output] >>> lexical_diversity_average(text1[0:999][500:1499+1000]) [output]1.9192307692307693[/output]