Bulk Generating Cloze Deletions based on Tatoeba sentences and word frequency lists - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: General Coding Help (https://python-forum.io/forum-8.html) +--- Thread: Bulk Generating Cloze Deletions based on Tatoeba sentences and word frequency lists (/thread-23263.html) Pages:
1
2
|
Bulk Generating Cloze Deletions based on Tatoeba sentences and word frequency lists - wizzie - Dec-18-2019 Hi all tl;dr: I have a list of sentences and a word frequency list. Python is to compare individual words in each sentence with the word frequency list and select an appropriate word for cloze deletion. Example Python code is available but I don't know what to do with it. I'm a native German speaker, am studying European Portuguese and am trying to bulk generate cloze deletions for import into the flashcard application Anki. The starting point are the freely available Tatoeba sentences, cloze selection is to be based on a word frequency list of the European Portuguese language. I read through the article "Bulk Generating Cloze Deletions for Learning a Language with Anki" which includes example python code. I was trying to reproduce what the author was doing - unfortunately to no avail. I've also reached out to the author of the article who does not have time to support. I've downloaded the Tatoeba sentences, Hermit Dave's word frequency list and extracted sentence pairs with a script available on github. However, starting with the paragraph "Choosing the cloze word", the article goes over my head. To summarize - what I have right now:
What I'm trying to automate (described in "Choosing the cloze word" and "Generating a CSV for Anki" in the article):
As I mentioned above, the article includes example python code but I don't know how to apply this. I would really appreciate any help. Thanks a lot. RE: Bulk Generating Cloze Deletions based on Tatoeba sentences and word frequency lists - buran - Dec-19-2019 assuming you generated the files as described in the article, you need to combine all code snippets in a single py file (i.e. just copy/paste them one after another), at the very bottom add line to call generate function, supplying the necessary arguments and run that py script. generate function uses the 3 functions before that.Also the voice generating function uses boto3 package. It assumes boto3 is already installed, see https://boto3.amazonaws.com/v1/documentation/api/latest/guide/quickstart.html#installation
RE: Bulk Generating Cloze Deletions based on Tatoeba sentences and word frequency lists - wizzie - Dec-19-2019 Thanks buran. This sounds challenging for someone with my Python skill level. I'm open to learning though and think understand what you are saying, hence will try the above and report back. With respect to the file generation - I did it not exactly as described in the article because the grep command did not extract the expected sentences. Hence, I extracted the sentences with a different script which results in a slighty different generated file. However, it isn't too complicated to fix this manually if need be. I'll skip the voice generating piece with Amazon Polly since Anki is able to do this after import. RE: Bulk Generating Cloze Deletions based on Tatoeba sentences and word frequency lists - buran - Dec-19-2019 I don't want to look into details "how different" the two formats are and what changes may be required. if you are not going to generate audio - you don't need the middle snippet, nor to install boto3 also remove # Generate audio audio_filename = 'fra-{}-audio.mp3'.format(fra_number) if not os.path.isfile(audio_filename): synthesize_speech(fra_sentence, audio_filename)from generate function. As to skill level - its as easy as running simple Hello world program
RE: Bulk Generating Cloze Deletions based on Tatoeba sentences and word frequency lists - wizzie - Dec-20-2019 Quote: As to skill level - its as easy as running simple Hello world programI'm sure it is. But in my case, I'd need support and possibly a good amount of trial and error for a Hello world program, too. My Python knowledge is very very limited - my apologies. Thank you for the pointers though. I've copied together the script parts and have removed all parts which are unnecessary due to it not needing to generate audio. I'm in the process of getting the base files into the same shape as outlined in the article and after that will be ready to give this a go. Quote:add line to call generate function, supplying the necessary arguments and run that py script. generate function uses the 3 functions before that.Can you please elaborate on how to do this? RE: Bulk Generating Cloze Deletions based on Tatoeba sentences and word frequency lists - buran - Dec-20-2019 (Dec-20-2019, 12:14 PM)wizzie Wrote: Can you please elaborate on how to do this?add generate(portuguese_sentence_file, german_sentence_file, links_file, frequency_list_file)of course replace with actual path and name to each of the files. if they are in the current working directory along your script from which folder you run the script just names would do. RE: Bulk Generating Cloze Deletions based on Tatoeba sentences and word frequency lists - wizzie - Dec-22-2019 Thank you buran, I'm almost there it seems. I've put this together and fixed occurring syntax errors and warnings online - mostly concerned with line length, white spaces and indents. The script fails using Python 2.7.17 but works well-ish when using Python 3.6.9. It shows "Making indexes..." for a while as expected but fails when "Generating clozes..." as follows See below FYI the whole script.import csv def find_cloze(sentence, frequency_list): translator = str.maketrans(string.punctuation, ' '*len(string.punctuation)) sentence = sentence.translate(translator) max_frequency = 20001 # Frequency list has 50,000 entries min_frequency = max_frequency min_word = None valid_words = [] for word in sentence.split(): if word.isupper() or word.istitle(): continue # Skip proper nouns if len(word) <= 2: continue # Skip tiny words valid_words.append(word) word_frequency = int(frequency_list.get(word.lower(), max_frequency)) if word_frequency < min_frequency: min_word = word min_frequency = word_frequency if min_word: return min_word else: if valid_words: return random.choice(valid_words) else: return None def make_index(path, delimiter, value=1): d = dict() with open(path, newline='') as f: reader = csv.reader(f, delimiter=delimiter) for row in reader: d[row[0]] = row[value] return d def generate(french_file, english_file, links_file, frequency_file): print("Making indexes ...") french = make_index(french_file, '\t', value=2) english = make_index(english_file, '\t', value=2) links = make_index(links_file, '\t') # Make index between word and usage frequency frequency = make_index(frequency_file, ' ') print("Generating clozes ...") with open("out.csv", 'w', newline='') as outfile: writer = csv.writer( outfile, delimiter='\t', quotechar='|', quoting=csv.QUOTE_MINIMAL) # For each French sentence for fra_number, fra_sentence in french.items(): # Lookup English translation eng_number = links.get(fra_number) if not eng_number: continue # If no English translation, skip eng_sentence = english.get(eng_number) if not eng_sentence: continue # If no English translation, skip # Find the cloze word fra_cloze_word = find_cloze(fra_sentence, frequency) if not fra_cloze_word: continue # If no cloze word, skip clozed = fra_sentence.replace( fra_cloze_word, '{{{{c1::{}}}}}'.format(fra_cloze_word) ) writer.writerow( fra_number, clozed, eng_number, eng_sentence) print("Done.") generate('portuguese.csv', 'deutsch.csv', 'links.csv', 'pt_20k.txt') RE: Bulk Generating Cloze Deletions based on Tatoeba sentences and word frequency lists - buran - Dec-22-2019 you need import string . put it after import csv . string is another module from python standard package.the code is for python3, so it's normal/likely to fail with python2 sorry, I didn't pay attention he didn't provide the import statement for csv and string. Funny, he did for botto3 RE: Bulk Generating Cloze Deletions based on Tatoeba sentences and word frequency lists - wizzie - Dec-23-2019 It threw the following error which I fixed by adding "[" and "]" around the writerow arguments. Then it threw which I fixed by adding "import random" and it now works. Thank you so much for your help, it is much appreciated! You saved me many hours of manual work. The final code is as follows. import csv import string import random def find_cloze(sentence, frequency_list): translator = str.maketrans(string.punctuation, ' '*len(string.punctuation)) sentence = sentence.translate(translator) max_frequency = 20001 # 50k frequency list cut down to 20,000 entries min_frequency = max_frequency min_word = None valid_words = [] for word in sentence.split(): if word.isupper() or word.istitle(): continue # Skip proper nouns if len(word) <= 2: continue # Skip tiny words valid_words.append(word) word_frequency = int(frequency_list.get(word.lower(), max_frequency)) if word_frequency < min_frequency: min_word = word min_frequency = word_frequency if min_word: return min_word else: if valid_words: return random.choice(valid_words) else: return None def make_index(path, delimiter, value=1): d = dict() with open(path, newline='') as f: reader = csv.reader(f, delimiter=delimiter) for row in reader: d[row[0]] = row[value] return d def generate(target_file, native_file, links_file, frequency_file): print("Making indexes ...") target = make_index(target_file, '\t', value=2) native = make_index(native_file, '\t', value=2) links = make_index(links_file, '\t') # Make index between word and usage frequency frequency = make_index(frequency_file, ' ') print("Generating clozes ...") with open("out.csv", 'w', newline='') as outfile: writer = csv.writer( outfile, delimiter='\t', quotechar='|', quoting=csv.QUOTE_MINIMAL) # For each target sentence for target_number, target_sentence in target.items(): # Lookup native translation native_number = links.get(target_number) if not native_number: continue # If no native translation, skip native_sentence = native.get(native_number) if not native_sentence: continue # If no native translation, skip # Find the cloze word target_cloze_word = find_cloze(target_sentence, frequency) if not target_cloze_word: continue # If no cloze word, skip clozed = target_sentence.replace( target_cloze_word, '{{{{c1::{}}}}}'.format(target_cloze_word) ) writer.writerow( [target_number, clozed, native_number, native_sentence]) print("Done.") generate('target.csv', 'native.csv', 'links.csv', 'frequency.txt')Edit: Changed code to be "language agnostic". RE: Bulk Generating Cloze Deletions based on Tatoeba sentences and word frequency lists - buran - Dec-23-2019 Great. You see it's not that hard :-). and the sqaure brackets were there, you removed them by mistake when deleting the sound part. |