Python Forum
Bulk Generating Cloze Deletions based on Tatoeba sentences and word frequency lists
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Bulk Generating Cloze Deletions based on Tatoeba sentences and word frequency lists
#1
Hi all

tl;dr: I have a list of sentences and a word frequency list. Python is to compare individual words in each sentence with the word frequency list and select an appropriate word for cloze deletion. Example Python code is available but I don't know what to do with it.

I'm a native German speaker, am studying European Portuguese and am trying to bulk generate cloze deletions for import into the flashcard application Anki. The starting point are the freely available Tatoeba sentences, cloze selection is to be based on a word frequency list of the European Portuguese language.

I read through the article "Bulk Generating Cloze Deletions for Learning a Language with Anki" which includes example python code. I was trying to reproduce what the author was doing - unfortunately to no avail. I've also reached out to the author of the article who does not have time to support.

I've downloaded the Tatoeba sentences, Hermit Dave's word frequency list and extracted sentence pairs with a script available on github. However, starting with the paragraph "Choosing the cloze word", the article goes over my head.

To summarize - what I have right now:
  1. File 1: A csv file with > 16'000 rows and two columns: Sentence in Portuguese; respective Sentence in German. Can be changed into a file with solely the Portuguese sentences if necessary.
  2. File 2: A word frequency list (txt) for European Portuguese with 50k words and two columns: Word in Portuguese; Number of occurrences in corpus. List will be shortened to 20k or so entries.

What I'm trying to automate (described in "Choosing the cloze word" and "Generating a CSV for Anki" in the article):
  1. Go through sentences in Portuguese in file 1
  2. Check words in sentences against the frequency list (file 2) and select the least common word (= minimum frequency, furthest down in the frequency list, i.e. the most "difficult" of the most common) as cloze
  3. Replace the selected cloze word in the sentence with the Anki code for a cloze, i.e. adding "{{c1::" in front of the selected word and "}}" after it.
    Before: Observe as estruturas do inglês nativo e tire suas conclusões.
    After: Observe as {{c1::estruturas}} do inglês nativo e tire suas conclusões.

As I mentioned above, the article includes example python code but I don't know how to apply this. I would really appreciate any help. Thanks a lot.
Reply
#2
assuming you generated the files as described in the article, you need to combine all code snippets in a single py file (i.e. just copy/paste them one after another), at the very bottom add line to call generate function, supplying the necessary arguments and run that py script. generate function uses the 3 functions before that.
Also the voice generating function uses boto3 package. It assumes boto3 is already installed, see https://boto3.amazonaws.com/v1/documenta...stallation
If you can't explain it to a six year old, you don't understand it yourself, Albert Einstein
How to Ask Questions The Smart Way: link and another link
Create MCV example
Debug small programs

Reply
#3
Thanks buran. This sounds challenging for someone with my Python skill level. I'm open to learning though and think understand what you are saying, hence will try the above and report back.
With respect to the file generation - I did it not exactly as described in the article because the grep command did not extract the expected sentences. Hence, I extracted the sentences with a different script which results in a slighty different generated file. However, it isn't too complicated to fix this manually if need be. I'll skip the voice generating piece with Amazon Polly since Anki is able to do this after import.
Reply
#4
I don't want to look into details "how different" the two formats are and what changes may be required.

if you are not going to generate audio - you don't need the middle snippet, nor to install boto3
also remove
 # Generate audio
            audio_filename = 'fra-{}-audio.mp3'.format(fra_number)
            if not os.path.isfile(audio_filename):
                synthesize_speech(fra_sentence, audio_filename)
from generate function.

As to skill level - its as easy as running simple Hello world program
If you can't explain it to a six year old, you don't understand it yourself, Albert Einstein
How to Ask Questions The Smart Way: link and another link
Create MCV example
Debug small programs

Reply
#5
Quote: As to skill level - its as easy as running simple Hello world program
I'm sure it is. But in my case, I'd need support and possibly a good amount of trial and error for a Hello world program, too. My Python knowledge is very very limited - my apologies.

Thank you for the pointers though. I've copied together the script parts and have removed all parts which are unnecessary due to it not needing to generate audio. I'm in the process of getting the base files into the same shape as outlined in the article and after that will be ready to give this a go.

Quote:add line to call generate function, supplying the necessary arguments and run that py script. generate function uses the 3 functions before that.
Can you please elaborate on how to do this?
Reply
#6
(Dec-20-2019, 12:14 PM)wizzie Wrote: Can you please elaborate on how to do this?
add
generate(portuguese_sentence_file, german_sentence_file, links_file, frequency_list_file)
of course replace with actual path and name to each of the files. if they are in the current working directory along your script from which folder you run the script just names would do.
If you can't explain it to a six year old, you don't understand it yourself, Albert Einstein
How to Ask Questions The Smart Way: link and another link
Create MCV example
Debug small programs

Reply
#7
Thank you buran, I'm almost there it seems. I've put this together and fixed occurring syntax errors and warnings online - mostly concerned with line length, white spaces and indents. The script fails using Python 2.7.17 but works well-ish when using Python 3.6.9. It shows "Making indexes..." for a while as expected but fails when "Generating clozes..." as follows
Error:
Making indexes ... Generating clozes ... Traceback (most recent call last): File "script.py", line 94, in <module> 'pt_20k.txt') File "script.py", line 74, in generate fra_cloze_word = find_cloze(fra_sentence, frequency) File "script.py", line 5, in find_cloze translator = str.maketrans(string.punctuation, ' '*len(string.punctuation)) NameError: name 'string' is not defined
See below FYI the whole script.
import csv


def find_cloze(sentence, frequency_list):
    translator = str.maketrans(string.punctuation, ' '*len(string.punctuation))
    sentence = sentence.translate(translator)

    max_frequency = 20001  # Frequency list has 50,000 entries
    min_frequency = max_frequency

    min_word = None
    valid_words = []
    for word in sentence.split():
        if word.isupper() or word.istitle():
            continue  # Skip proper nouns
        if len(word) <= 2:
            continue  # Skip tiny words

        valid_words.append(word)

        word_frequency = int(frequency_list.get(word.lower(), max_frequency))
        if word_frequency < min_frequency:
            min_word = word
            min_frequency = word_frequency

    if min_word:
        return min_word
    else:
        if valid_words:
            return random.choice(valid_words)
        else:
            return None


def make_index(path, delimiter, value=1):
    d = dict()
    with open(path, newline='') as f:
        reader = csv.reader(f, delimiter=delimiter)
        for row in reader:
            d[row[0]] = row[value]
    return d


def generate(french_file, english_file, links_file, frequency_file):
    print("Making indexes ...")

    french = make_index(french_file, '\t', value=2)
    english = make_index(english_file, '\t', value=2)
    links = make_index(links_file, '\t')

    # Make index between word and usage frequency
    frequency = make_index(frequency_file, ' ')

    print("Generating clozes ...")
    with open("out.csv", 'w', newline='') as outfile:
        writer = csv.writer(
            outfile,
            delimiter='\t',
            quotechar='|',
            quoting=csv.QUOTE_MINIMAL)

        # For each French sentence
        for fra_number, fra_sentence in french.items():
            # Lookup English translation
            eng_number = links.get(fra_number)
            if not eng_number:
                continue  # If no English translation, skip

            eng_sentence = english.get(eng_number)
            if not eng_sentence:
                continue  # If no English translation, skip

            # Find the cloze word
            fra_cloze_word = find_cloze(fra_sentence, frequency)
            if not fra_cloze_word:
                continue  # If no cloze word, skip

            clozed = fra_sentence.replace(
            	fra_cloze_word,
                '{{{{c1::{}}}}}'.format(fra_cloze_word)
                )

            writer.writerow(
                fra_number,
                clozed,
                eng_number,
                eng_sentence)

    print("Done.")

generate('portuguese.csv',
         'deutsch.csv',
         'links.csv',
         'pt_20k.txt')
Reply
#8
you need import string. put it after import csv. string is another module from python standard package.
the code is for python3, so it's normal/likely to fail with python2

sorry, I didn't pay attention he didn't provide the import statement for csv and string. Funny, he did for botto3
If you can't explain it to a six year old, you don't understand it yourself, Albert Einstein
How to Ask Questions The Smart Way: link and another link
Create MCV example
Debug small programs

Reply
#9
It threw the following error
Error:
writerow() takes exactly one argument (4 given)
which I fixed by adding "[" and "]" around the writerow arguments. Then it threw
Error:
NameError: name 'random' is not defined
which I fixed by adding "import random" and it now works.

Thank you so much for your help, it is much appreciated! You saved me many hours of manual work.
The final code is as follows.

import csv

import string

import random

def find_cloze(sentence, frequency_list):
    translator = str.maketrans(string.punctuation, ' '*len(string.punctuation))
    sentence = sentence.translate(translator)

    max_frequency = 20001  # 50k frequency list cut down to 20,000 entries
    min_frequency = max_frequency

    min_word = None
    valid_words = []
    for word in sentence.split():
        if word.isupper() or word.istitle():
            continue  # Skip proper nouns
        if len(word) <= 2:
            continue  # Skip tiny words

        valid_words.append(word)

        word_frequency = int(frequency_list.get(word.lower(), max_frequency))
        if word_frequency < min_frequency:
            min_word = word
            min_frequency = word_frequency

    if min_word:
        return min_word
    else:
        if valid_words:
            return random.choice(valid_words)
        else:
            return None


def make_index(path, delimiter, value=1):
    d = dict()
    with open(path, newline='') as f:
        reader = csv.reader(f, delimiter=delimiter)
        for row in reader:
            d[row[0]] = row[value]
    return d


def generate(target_file, native_file, links_file, frequency_file):
    print("Making indexes ...")

    target = make_index(target_file, '\t', value=2)
    native = make_index(native_file, '\t', value=2)
    links = make_index(links_file, '\t')

    # Make index between word and usage frequency
    frequency = make_index(frequency_file, ' ')

    print("Generating clozes ...")
    with open("out.csv", 'w', newline='') as outfile:
        writer = csv.writer(
            outfile,
            delimiter='\t',
            quotechar='|',
            quoting=csv.QUOTE_MINIMAL)

        # For each target sentence
        for target_number, target_sentence in target.items():
            # Lookup native translation
            native_number = links.get(target_number)
            if not native_number:
                continue  # If no native translation, skip

            native_sentence = native.get(native_number)
            if not native_sentence:
                continue  # If no native translation, skip

            # Find the cloze word
            target_cloze_word = find_cloze(target_sentence, frequency)
            if not target_cloze_word:
                continue  # If no cloze word, skip

            clozed = target_sentence.replace(
                target_cloze_word,
                '{{{{c1::{}}}}}'.format(target_cloze_word)
                )

            writer.writerow(
                [target_number,
                clozed,
                native_number,
                native_sentence])

    print("Done.")

generate('target.csv',
         'native.csv',
         'links.csv',
         'frequency.txt')
Edit: Changed code to be "language agnostic".
Reply
#10
Great. You see it's not that hard :-). and the sqaure brackets were there, you removed them by mistake when deleting the sound part.
If you can't explain it to a six year old, you don't understand it yourself, Albert Einstein
How to Ask Questions The Smart Way: link and another link
Create MCV example
Debug small programs

Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Bulk loading of data using python shivamsvmsri 2 690 Sep-28-2023, 09:04 AM
Last Post: shivamsvmsri
  seaching for a library: nondeterministic letter manipulation in sentences Myron 2 924 Dec-05-2022, 03:53 PM
Last Post: Myron
Question Problem: Check if a list contains a word and then continue with the next word Mangono 2 2,516 Aug-12-2021, 04:25 PM
Last Post: palladium
  regex pattern to extract relevant sentences Bubly 2 1,871 Jul-06-2021, 04:17 PM
Last Post: Bubly
  How can I get Python Bulk Email Verification Script With API? zainalee 1 2,496 Jun-06-2021, 09:19 AM
Last Post: snippsat
  Extract specific sentences from text file Bubly 3 3,421 May-31-2021, 06:55 PM
Last Post: Larz60+
Video Python Bulk Email Verification Script With API Aj1128 0 2,627 Nov-28-2020, 11:38 AM
Last Post: Aj1128
  Bulk add column to dataframe sambanerjee 1 2,135 Sep-24-2020, 07:34 PM
Last Post: sambanerjee
  Split dict of lists into smaller dicts of lists. pcs3rd 3 2,383 Sep-19-2020, 09:12 AM
Last Post: ibreeden
  bulk update in elasticsearch pythonlearner1 1 5,999 Jun-10-2020, 10:01 PM
Last Post: pythonlearner1

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020