Bulk Generating Cloze Deletions based on Tatoeba sentences and word frequency lists

Bulk Generating Cloze Deletions based on Tatoeba sentences and word frequency lists - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: General Coding Help (https://python-forum.io/forum-8.html)
+--- Thread: Bulk Generating Cloze Deletions based on Tatoeba sentences and word frequency lists (/thread-23263.html)

Pages: 1 2

Bulk Generating Cloze Deletions based on Tatoeba sentences and word frequency lists - wizzie - Dec-18-2019

Hi all

tl;dr: I have a list of sentences and a word frequency list. Python is to compare individual words in each sentence with the word frequency list and select an appropriate word for cloze deletion. Example Python code is available but I don't know what to do with it.

I'm a native German speaker, am studying European Portuguese and am trying to bulk generate cloze deletions for import into the flashcard application Anki. The starting point are the freely available Tatoeba sentences, cloze selection is to be based on a word frequency list of the European Portuguese language.

I read through the article "Bulk Generating Cloze Deletions for Learning a Language with Anki" which includes example python code. I was trying to reproduce what the author was doing - unfortunately to no avail. I've also reached out to the author of the article who does not have time to support.

I've downloaded the Tatoeba sentences, Hermit Dave's word frequency list and extracted sentence pairs with a script available on github. However, starting with the paragraph "Choosing the cloze word", the article goes over my head.

To summarize - what I have right now:

File 1: A csv file with > 16'000 rows and two columns: Sentence in Portuguese; respective Sentence in German. Can be changed into a file with solely the Portuguese sentences if necessary.
File 2: A word frequency list (txt) for European Portuguese with 50k words and two columns: Word in Portuguese; Number of occurrences in corpus. List will be shortened to 20k or so entries.

What I'm trying to automate (described in "Choosing the cloze word" and "Generating a CSV for Anki" in the article):

Go through sentences in Portuguese in file 1
Check words in sentences against the frequency list (file 2) and select the least common word (= minimum frequency, furthest down in the frequency list, i.e. the most "difficult" of the most common) as cloze
Replace the selected cloze word in the sentence with the Anki code for a cloze, i.e. adding "{{c1::" in front of the selected word and "}}" after it.
Before: Observe as estruturas do inglês nativo e tire suas conclusões.
After: Observe as {{c1::estruturas}} do inglês nativo e tire suas conclusões.

As I mentioned above, the article includes example python code but I don't know how to apply this. I would really appreciate any help. Thanks a lot.

RE: Bulk Generating Cloze Deletions based on Tatoeba sentences and word frequency lists - buran - Dec-19-2019

assuming you generated the files as described in the article, you need to combine all code snippets in a single py file (i.e. just copy/paste them one after another), at the very bottom add line to call generate function, supplying the necessary arguments and run that py script. generate function uses the 3 functions before that.
Also the voice generating function uses boto3 package. It assumes boto3 is already installed, see https://boto3.amazonaws.com/v1/documentation/api/latest/guide/quickstart.html#installation

RE: Bulk Generating Cloze Deletions based on Tatoeba sentences and word frequency lists - wizzie - Dec-19-2019

Thanks buran. This sounds challenging for someone with my Python skill level. I'm open to learning though and think understand what you are saying, hence will try the above and report back.
With respect to the file generation - I did it not exactly as described in the article because the grep command did not extract the expected sentences. Hence, I extracted the sentences with a different script which results in a slighty different generated file. However, it isn't too complicated to fix this manually if need be. I'll skip the voice generating piece with Amazon Polly since Anki is able to do this after import.

RE: Bulk Generating Cloze Deletions based on Tatoeba sentences and word frequency lists - buran - Dec-19-2019

I don't want to look into details "how different" the two formats are and what changes may be required.

if you are not going to generate audio - you don't need the middle snippet, nor to install boto3
also remove

 # Generate audio
            audio_filename = 'fra-{}-audio.mp3'.format(fra_number)
            if not os.path.isfile(audio_filename):
                synthesize_speech(fra_sentence, audio_filename)

from generate function.

As to skill level - its as easy as running simple Hello world program

RE: Bulk Generating Cloze Deletions based on Tatoeba sentences and word frequency lists - wizzie - Dec-20-2019

Quote: As to skill level - its as easy as running simple Hello world program

I'm sure it is. But in my case, I'd need support and possibly a good amount of trial and error for a Hello world program, too. My Python knowledge is very very limited - my apologies.

Thank you for the pointers though. I've copied together the script parts and have removed all parts which are unnecessary due to it not needing to generate audio. I'm in the process of getting the base files into the same shape as outlined in the article and after that will be ready to give this a go.

Quote:add line to call generate function, supplying the necessary arguments and run that py script. generate function uses the 3 functions before that.

Can you please elaborate on how to do this?

RE: Bulk Generating Cloze Deletions based on Tatoeba sentences and word frequency lists - buran - Dec-20-2019

(Dec-20-2019, 12:14 PM)wizzie Wrote: Can you please elaborate on how to do this?

add

generate(portuguese_sentence_file, german_sentence_file, links_file, frequency_list_file)

of course replace with actual path and name to each of the files. if they are in the current working directory along your script from which folder you run the script just names would do.

RE: Bulk Generating Cloze Deletions based on Tatoeba sentences and word frequency lists - wizzie - Dec-22-2019

Thank you buran, I'm almost there it seems. I've put this together and fixed occurring syntax errors and warnings online - mostly concerned with line length, white spaces and indents. The script fails using Python 2.7.17 but works well-ish when using Python 3.6.9. It shows "Making indexes..." for a while as expected but fails when "Generating clozes..." as follows

Error:Making indexes ...
Generating clozes ...
Traceback (most recent call last):
  File "script.py", line 94, in <module>
    'pt_20k.txt')
  File "script.py", line 74, in generate
    fra_cloze_word = find_cloze(fra_sentence, frequency)
  File "script.py", line 5, in find_cloze
    translator = str.maketrans(string.punctuation, ' '*len(string.punctuation))
NameError: name 'string' is not defined

See below FYI the whole script.

import csv


def find_cloze(sentence, frequency_list):
    translator = str.maketrans(string.punctuation, ' '*len(string.punctuation))
    sentence = sentence.translate(translator)

    max_frequency = 20001  # Frequency list has 50,000 entries
    min_frequency = max_frequency

    min_word = None
    valid_words = []
    for word in sentence.split():
        if word.isupper() or word.istitle():
            continue  # Skip proper nouns
        if len(word) <= 2:
            continue  # Skip tiny words

        valid_words.append(word)

        word_frequency = int(frequency_list.get(word.lower(), max_frequency))
        if word_frequency < min_frequency:
            min_word = word
            min_frequency = word_frequency

    if min_word:
        return min_word
    else:
        if valid_words:
            return random.choice(valid_words)
        else:
            return None


def make_index(path, delimiter, value=1):
    d = dict()
    with open(path, newline='') as f:
        reader = csv.reader(f, delimiter=delimiter)
        for row in reader:
            d[row[0]] = row[value]
    return d


def generate(french_file, english_file, links_file, frequency_file):
    print("Making indexes ...")

    french = make_index(french_file, '\t', value=2)
    english = make_index(english_file, '\t', value=2)
    links = make_index(links_file, '\t')

    # Make index between word and usage frequency
    frequency = make_index(frequency_file, ' ')

    print("Generating clozes ...")
    with open("out.csv", 'w', newline='') as outfile:
        writer = csv.writer(
            outfile,
            delimiter='\t',
            quotechar='|',
            quoting=csv.QUOTE_MINIMAL)

        # For each French sentence
        for fra_number, fra_sentence in french.items():
            # Lookup English translation
            eng_number = links.get(fra_number)
            if not eng_number:
                continue  # If no English translation, skip

            eng_sentence = english.get(eng_number)
            if not eng_sentence:
                continue  # If no English translation, skip

            # Find the cloze word
            fra_cloze_word = find_cloze(fra_sentence, frequency)
            if not fra_cloze_word:
                continue  # If no cloze word, skip

            clozed = fra_sentence.replace(
            	fra_cloze_word,
                '{{{{c1::{}}}}}'.format(fra_cloze_word)
                )

            writer.writerow(
                fra_number,
                clozed,
                eng_number,
                eng_sentence)

    print("Done.")

generate('portuguese.csv',
         'deutsch.csv',
         'links.csv',
         'pt_20k.txt')

RE: Bulk Generating Cloze Deletions based on Tatoeba sentences and word frequency lists - buran - Dec-22-2019

you need import string. put it after import csv. string is another module from python standard package.
the code is for python3, so it's normal/likely to fail with python2

sorry, I didn't pay attention he didn't provide the import statement for csv and string. Funny, he did for botto3

RE: Bulk Generating Cloze Deletions based on Tatoeba sentences and word frequency lists - wizzie - Dec-23-2019

It threw the following error

Error:
writerow() takes exactly one argument (4 given)

which I fixed by adding "[" and "]" around the writerow arguments. Then it threw

Error:
NameError: name 'random' is not defined

which I fixed by adding "import random" and it now works.

Thank you so much for your help, it is much appreciated! You saved me many hours of manual work.
The final code is as follows.

import csv

import string

import random

def find_cloze(sentence, frequency_list):
    translator = str.maketrans(string.punctuation, ' '*len(string.punctuation))
    sentence = sentence.translate(translator)

    max_frequency = 20001  # 50k frequency list cut down to 20,000 entries
    min_frequency = max_frequency

    min_word = None
    valid_words = []
    for word in sentence.split():
        if word.isupper() or word.istitle():
            continue  # Skip proper nouns
        if len(word) <= 2:
            continue  # Skip tiny words

        valid_words.append(word)

        word_frequency = int(frequency_list.get(word.lower(), max_frequency))
        if word_frequency < min_frequency:
            min_word = word
            min_frequency = word_frequency

    if min_word:
        return min_word
    else:
        if valid_words:
            return random.choice(valid_words)
        else:
            return None


def make_index(path, delimiter, value=1):
    d = dict()
    with open(path, newline='') as f:
        reader = csv.reader(f, delimiter=delimiter)
        for row in reader:
            d[row[0]] = row[value]
    return d


def generate(target_file, native_file, links_file, frequency_file):
    print("Making indexes ...")

    target = make_index(target_file, '\t', value=2)
    native = make_index(native_file, '\t', value=2)
    links = make_index(links_file, '\t')

    # Make index between word and usage frequency
    frequency = make_index(frequency_file, ' ')

    print("Generating clozes ...")
    with open("out.csv", 'w', newline='') as outfile:
        writer = csv.writer(
            outfile,
            delimiter='\t',
            quotechar='|',
            quoting=csv.QUOTE_MINIMAL)

        # For each target sentence
        for target_number, target_sentence in target.items():
            # Lookup native translation
            native_number = links.get(target_number)
            if not native_number:
                continue  # If no native translation, skip

            native_sentence = native.get(native_number)
            if not native_sentence:
                continue  # If no native translation, skip

            # Find the cloze word
            target_cloze_word = find_cloze(target_sentence, frequency)
            if not target_cloze_word:
                continue  # If no cloze word, skip

            clozed = target_sentence.replace(
                target_cloze_word,
                '{{{{c1::{}}}}}'.format(target_cloze_word)
                )

            writer.writerow(
                [target_number,
                clozed,
                native_number,
                native_sentence])

    print("Done.")

generate('target.csv',
         'native.csv',
         'links.csv',
         'frequency.txt')

Edit: Changed code to be "language agnostic".

RE: Bulk Generating Cloze Deletions based on Tatoeba sentences and word frequency lists - buran - Dec-23-2019

Great. You see it's not that hard :-). and the sqaure brackets were there, you removed them by mistake when deleting the sound part.