Bulk Generating Cloze Deletions based on Tatoeba sentences and word frequency lists

wizzie · Dec-18-2019, 09:40 PM

Hi all

tl;dr: I have a list of sentences and a word frequency list. Python is to compare individual words in each sentence with the word frequency list and select an appropriate word for cloze deletion. Example Python code is available but I don't know what to do with it.

I'm a native German speaker, am studying European Portuguese and am trying to bulk generate cloze deletions for import into the flashcard application Anki. The starting point are the freely available Tatoeba sentences, cloze selection is to be based on a word frequency list of the European Portuguese language.

I read through the article "Bulk Generating Cloze Deletions for Learning a Language with Anki" which includes example python code. I was trying to reproduce what the author was doing - unfortunately to no avail. I've also reached out to the author of the article who does not have time to support.

I've downloaded the Tatoeba sentences, Hermit Dave's word frequency list and extracted sentence pairs with a script available on github. However, starting with the paragraph "Choosing the cloze word", the article goes over my head.

To summarize - what I have right now:

File 1: A csv file with > 16'000 rows and two columns: Sentence in Portuguese; respective Sentence in German. Can be changed into a file with solely the Portuguese sentences if necessary.
File 2: A word frequency list (txt) for European Portuguese with 50k words and two columns: Word in Portuguese; Number of occurrences in corpus. List will be shortened to 20k or so entries.

What I'm trying to automate (described in "Choosing the cloze word" and "Generating a CSV for Anki" in the article):

Go through sentences in Portuguese in file 1
Check words in sentences against the frequency list (file 2) and select the least common word (= minimum frequency, furthest down in the frequency list, i.e. the most "difficult" of the most common) as cloze
Replace the selected cloze word in the sentence with the Anki code for a cloze, i.e. adding "{{c1::" in front of the selected word and "}}" after it.
Before: Observe as estruturas do inglês nativo e tire suas conclusões.
After: Observe as {{c1::estruturas}} do inglês nativo e tire suas conclusões.

As I mentioned above, the article includes example python code but I don't know how to apply this. I would really appreciate any help. Thanks a lot.

**buran** · (This post was last modified: Dec-19-2019, 10:56 AM by buran.)

assuming you generated the files as described in the article, you need to combine all code snippets in a single py file (i.e. just copy/paste them one after another), at the very bottom add line to call generate function, supplying the necessary arguments and run that py script. generate function uses the 3 functions before that.
Also the voice generating function uses boto3 package. It assumes boto3 is already installed, see https://boto3.amazonaws.com/v1/documenta...stallation

wizzie · Dec-19-2019, 12:47 PM

Thanks buran. This sounds challenging for someone with my Python skill level. I'm open to learning though and think understand what you are saying, hence will try the above and report back.
With respect to the file generation - I did it not exactly as described in the article because the grep command did not extract the expected sentences. Hence, I extracted the sentences with a different script which results in a slighty different generated file. However, it isn't too complicated to fix this manually if need be. I'll skip the voice generating piece with Amazon Polly since Anki is able to do this after import.

**buran** · Dec-19-2019, 12:52 PM

I don't want to look into details "how different" the two formats are and what changes may be required.

if you are not going to generate audio - you don't need the middle snippet, nor to install boto3
also remove

 # Generate audio
            audio_filename = 'fra-{}-audio.mp3'.format(fra_number)
            if not os.path.isfile(audio_filename):
                synthesize_speech(fra_sentence, audio_filename)

from generate function.

As to skill level - its as easy as running simple Hello world program

wizzie · Dec-20-2019, 12:14 PM

Quote: As to skill level - its as easy as running simple Hello world program

I'm sure it is. But in my case, I'd need support and possibly a good amount of trial and error for a Hello world program, too. My Python knowledge is very very limited - my apologies.

Thank you for the pointers though. I've copied together the script parts and have removed all parts which are unnecessary due to it not needing to generate audio. I'm in the process of getting the base files into the same shape as outlined in the article and after that will be ready to give this a go.

Quote:add line to call generate function, supplying the necessary arguments and run that py script. generate function uses the 3 functions before that.

Can you please elaborate on how to do this?

**buran** · Dec-20-2019, 12:28 PM

(Dec-20-2019, 12:14 PM)wizzie Wrote: Can you please elaborate on how to do this?

add

generate(portuguese_sentence_file, german_sentence_file, links_file, frequency_list_file)

of course replace with actual path and name to each of the files. if they are in the current working directory along your script from which folder you run the script just names would do.

wizzie · (This post was last modified: Dec-22-2019, 03:31 PM by wizzie.)

Thank you buran, I'm almost there it seems. I've put this together and fixed occurring syntax errors and warnings online - mostly concerned with line length, white spaces and indents. The script fails using Python 2.7.17 but works well-ish when using Python 3.6.9. It shows "Making indexes..." for a while as expected but fails when "Generating clozes..." as follows

Error:Making indexes ...
Generating clozes ...
Traceback (most recent call last):
  File "script.py", line 94, in <module>
    'pt_20k.txt')
  File "script.py", line 74, in generate
    fra_cloze_word = find_cloze(fra_sentence, frequency)
  File "script.py", line 5, in find_cloze
    translator = str.maketrans(string.punctuation, ' '*len(string.punctuation))
NameError: name 'string' is not defined

See below FYI the whole script.

import csv


def find_cloze(sentence, frequency_list):
    translator = str.maketrans(string.punctuation, ' '*len(string.punctuation))
    sentence = sentence.translate(translator)

    max_frequency = 20001  # Frequency list has 50,000 entries
    min_frequency = max_frequency

    min_word = None
    valid_words = []
    for word in sentence.split():
        if word.isupper() or word.istitle():
            continue  # Skip proper nouns
        if len(word) <= 2:
            continue  # Skip tiny words

        valid_words.append(word)

        word_frequency = int(frequency_list.get(word.lower(), max_frequency))
        if word_frequency < min_frequency:
            min_word = word
            min_frequency = word_frequency

    if min_word:
        return min_word
    else:
        if valid_words:
            return random.choice(valid_words)
        else:
            return None


def make_index(path, delimiter, value=1):
    d = dict()
    with open(path, newline='') as f:
        reader = csv.reader(f, delimiter=delimiter)
        for row in reader:
            d[row[0]] = row[value]
    return d


def generate(french_file, english_file, links_file, frequency_file):
    print("Making indexes ...")

    french = make_index(french_file, '\t', value=2)
    english = make_index(english_file, '\t', value=2)
    links = make_index(links_file, '\t')

    # Make index between word and usage frequency
    frequency = make_index(frequency_file, ' ')

    print("Generating clozes ...")
    with open("out.csv", 'w', newline='') as outfile:
        writer = csv.writer(
            outfile,
            delimiter='\t',
            quotechar='|',
            quoting=csv.QUOTE_MINIMAL)

        # For each French sentence
        for fra_number, fra_sentence in french.items():
            # Lookup English translation
            eng_number = links.get(fra_number)
            if not eng_number:
                continue  # If no English translation, skip

            eng_sentence = english.get(eng_number)
            if not eng_sentence:
                continue  # If no English translation, skip

            # Find the cloze word
            fra_cloze_word = find_cloze(fra_sentence, frequency)
            if not fra_cloze_word:
                continue  # If no cloze word, skip

            clozed = fra_sentence.replace(
            	fra_cloze_word,
                '{{{{c1::{}}}}}'.format(fra_cloze_word)
                )

            writer.writerow(
                fra_number,
                clozed,
                eng_number,
                eng_sentence)

    print("Done.")

generate('portuguese.csv',
         'deutsch.csv',
         'links.csv',
         'pt_20k.txt')

**buran** · (This post was last modified: Dec-22-2019, 03:56 PM by buran.)

you need import string. put it after import csv. string is another module from python standard package.
the code is for python3, so it's normal/likely to fail with python2

sorry, I didn't pay attention he didn't provide the import statement for csv and string. Funny, he did for botto3

wizzie · (This post was last modified: Dec-23-2019, 01:00 PM by wizzie.)

It threw the following error

Error:
writerow() takes exactly one argument (4 given)

which I fixed by adding "[" and "]" around the writerow arguments. Then it threw

Error:
NameError: name 'random' is not defined

which I fixed by adding "import random" and it now works.

Thank you so much for your help, it is much appreciated! You saved me many hours of manual work.
The final code is as follows.

import csv

import string

import random

def find_cloze(sentence, frequency_list):
    translator = str.maketrans(string.punctuation, ' '*len(string.punctuation))
    sentence = sentence.translate(translator)

    max_frequency = 20001  # 50k frequency list cut down to 20,000 entries
    min_frequency = max_frequency

    min_word = None
    valid_words = []
    for word in sentence.split():
        if word.isupper() or word.istitle():
            continue  # Skip proper nouns
        if len(word) <= 2:
            continue  # Skip tiny words

        valid_words.append(word)

        word_frequency = int(frequency_list.get(word.lower(), max_frequency))
        if word_frequency < min_frequency:
            min_word = word
            min_frequency = word_frequency

    if min_word:
        return min_word
    else:
        if valid_words:
            return random.choice(valid_words)
        else:
            return None


def make_index(path, delimiter, value=1):
    d = dict()
    with open(path, newline='') as f:
        reader = csv.reader(f, delimiter=delimiter)
        for row in reader:
            d[row[0]] = row[value]
    return d


def generate(target_file, native_file, links_file, frequency_file):
    print("Making indexes ...")

    target = make_index(target_file, '\t', value=2)
    native = make_index(native_file, '\t', value=2)
    links = make_index(links_file, '\t')

    # Make index between word and usage frequency
    frequency = make_index(frequency_file, ' ')

    print("Generating clozes ...")
    with open("out.csv", 'w', newline='') as outfile:
        writer = csv.writer(
            outfile,
            delimiter='\t',
            quotechar='|',
            quoting=csv.QUOTE_MINIMAL)

        # For each target sentence
        for target_number, target_sentence in target.items():
            # Lookup native translation
            native_number = links.get(target_number)
            if not native_number:
                continue  # If no native translation, skip

            native_sentence = native.get(native_number)
            if not native_sentence:
                continue  # If no native translation, skip

            # Find the cloze word
            target_cloze_word = find_cloze(target_sentence, frequency)
            if not target_cloze_word:
                continue  # If no cloze word, skip

            clozed = target_sentence.replace(
                target_cloze_word,
                '{{{{c1::{}}}}}'.format(target_cloze_word)
                )

            writer.writerow(
                [target_number,
                clozed,
                native_number,
                native_sentence])

    print("Done.")

generate('target.csv',
         'native.csv',
         'links.csv',
         'frequency.txt')

Edit: Changed code to be "language agnostic".

**buran** · (This post was last modified: Dec-23-2019, 12:08 PM by buran.)

Great. You see it's not that hard :-). and the sqaure brackets were there, you removed them by mistake when deleting the sound part.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Bulk loading of data using python	shivamsvmsri	2	690	Sep-28-2023, 09:04 AM Last Post: shivamsvmsri
	seaching for a library: nondeterministic letter manipulation in sentences	Myron	2	924	Dec-05-2022, 03:53 PM Last Post: Myron
	Problem: Check if a list contains a word and then continue with the next word	Mangono	2	2,516	Aug-12-2021, 04:25 PM Last Post: palladium
	regex pattern to extract relevant sentences	Bubly	2	1,871	Jul-06-2021, 04:17 PM Last Post: Bubly
	How can I get Python Bulk Email Verification Script With API?	zainalee	1	2,496	Jun-06-2021, 09:19 AM Last Post: snippsat
	Extract specific sentences from text file	Bubly	3	3,421	May-31-2021, 06:55 PM Last Post: Larz60+
	Python Bulk Email Verification Script With API	Aj1128	0	2,627	Nov-28-2020, 11:38 AM Last Post: Aj1128
	Bulk add column to dataframe	sambanerjee	1	2,135	Sep-24-2020, 07:34 PM Last Post: sambanerjee
	Split dict of lists into smaller dicts of lists.	pcs3rd	3	2,383	Sep-19-2020, 09:12 AM Last Post: ibreeden
	bulk update in elasticsearch	pythonlearner1	1	5,999	Jun-10-2020, 10:01 PM Last Post: pythonlearner1

Bulk Generating Cloze Deletions based on Tatoeba sentences and word frequency lists

User Panel Messages

Announcements