How to compute conditional unigram probabilities?

jbond · Jan-25-2020, 11:34 AM

Hello,

i have difficulties with my homework (Task 4).
I don't know how to do this.
I have already an attempt but I think it is wrong and I don't know how to go on.
The task gives me pseudocode as a hint but I can't make code from it.
I know it's not that hard and ist only a few lines, but I have no Idea what to do.

my code for task 1, 2, 3 and my attempt for task 4:

 
import re

class Ngram:
    filename = ""
    n = 0
    raw_counts = {}
    prob = {}
    cond_prob = {}
    
    # Task 1
    def __init__(self, filename="", n=0):
        self.filename = filename
        self.n = n

    # Task 2
    def extract_raw_counts(self):
        fp = open(self.filename, 'r') 
        allLines = fp.readlines()
        for line in allLines: 
            tokenLst = tokenize_smart(line.rstrip("\r\n"))
            for i in range(0,self.n-1):
                tokenLst.insert(0,"BOS")
                tokenLst.append("EOS")
            for i in range(len(tokenLst)-self.n):
                newTuple = tuple(tokenLst[i:i+self.n])
                if newTuple in self.raw_counts:
                    self.raw_counts[newTuple] += 1
                else:
                    self.raw_counts[newTuple] = 1
        
    # Task 3
    def extract_probabilities(self):
        sumRawCounts = sum(self.raw_counts.values()) + len(self.raw_counts)
        for key in self.raw_counts:
            self.prob[key] = self.raw_counts[key] / sumRawCounts

    # Task 4
    def extract_conditional_probabilities(self):
        #my attemt for task 4
        for key in self.prob:
            mgram = key[0:self.n-1]
            unigram = key[self.n]
            if not mgram in self.prob:
                self.prob[mgram] = {}
            else:
                self.cond_prob[mgram] = unigram
            
        
        pass

    # Task 5
    def generate_random_token(self, mgram):
        """
        Generate a random next token based on an n-1 gram,
        taking into account the probability distribution over the possible next tokens for that n-1-gram.

        :param mgram: the n-1 gram to generate the next token for.
        :type mgram: a tuple (of length n-1) of strings.
        :return a random next token for the n-1-gram.
        :rtype str
        """
        pass

    # Task 6
    def generate_random_sentence(self):
        """
        Generate a random sentence.

        :return a random sentence
        :rtype list[str]
        """
        pass


def tokenize_smart(sentence):
    """
    Tokenize the sentence into tokens (words, punctuation).

    :param sentence: the sentence to be tokenized
    :type sentence: str
    :return: list of tokens in the sentence
    :rtype: list[str]
    """
    tokens = []
    for word in re.sub(r" +", " ", sentence).split():
        word = re.sub(r"[\"„”“»«`\(\)]", "", word)
        if word != "":
            if word[-1] in ".,!?;:":
                if len(word) == 1:
                    tokens += [word]
                else:
                    tokens += [word[:-1], word[-1]]
            else:
                tokens.append(word)

    return tokens


def list2str(sentence):
    """
    Convert a sentence given as a list of strings to the sentence as a string separated by whitespace.
    
    :param sentence: the string list to be joined
    :type sentence: list[str]
    :return: sentence as a string, separated by whitespace
    :rtype: str
    """
    sentence = " ".join(sentence)
    sentence = re.sub(r" ([\.,!\?;:])", r"\1", sentence)
    return sentence


if __name__ == '__main__':
    
    # Task 1
    print("Task 1:")
    ngram_model = Ngram("de-sentences-tatoeba.txt", 2)
    print(ngram_model.n, ngram_model.filename)
    print(ngram_model.raw_counts, ngram_model.prob, ngram_model.cond_prob)
    
    # Task 2
    print("\nTask 2:")
    ngram_model.extract_raw_counts()
    print(ngram_model.raw_counts[("kaltes", "Land")])
    print(ngram_model.raw_counts[("schönes", "Land")])
    
    # Task 3
    print("\nTask 3:")
    ngram_model.extract_probabilities()
    print(ngram_model.prob[("kaltes", "Land")])
    print(ngram_model.prob[("schönes", "Land")])
    
    '''
    # Task 4
    ngram_model.extract_conditional_probabilities()
    print(ngram_model.cond_prob[(" beobachteten ",)])
    print(ngram_model.cond_prob[("schönes",)][("Land")])
    # Task 5
    print(ngram_model.generate_random_token(("den",)))
    print(ngram_model.generate_random_token(("den",)))
    print(ngram_model.generate_random_token(("den",)))
    # Task 6
    print(list2str(ngram_model.generate_random_sentence()))
    print(list2str(ngram_model.generate_random_sentence()))
    '''

Task 1 and 2:
[Image: c8kWdT3]

Task 3 and 4: (Here I get stucked)
[Image: nnd3pD4]

Can someone help me or give me a hint?

Thank you in advance

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Maths and python: Compute stress level	cheerful	1	2,812	Oct-20-2021, 10:05 AM Last Post: Larz60+
	Compute complex solutions in quadratic equations	liam	1	1,979	Feb-09-2020, 04:18 PM Last Post: Gribouillis
	To extract a specific column from csv file and compute the average	vicson	2	8,219	Oct-20-2018, 03:18 AM Last Post: vicson
	Write a program to compute the sum of the terms of the series: 4 - 8 + 12 - 16 + 20 -	chewey777	0	2,898	Mar-24-2018, 12:39 AM Last Post: chewey777
	How do you compute tf-idf from a list without using the counter class	syntaxkiller	8	5,439	Dec-01-2017, 05:24 PM Last Post: nilamo
	compute gross pay	jamesuzo	1	10,415	Sep-07-2017, 01:47 PM Last Post: ichabod801

How to compute conditional unigram probabilities?

User Panel Messages

Announcements