Hi,
Not sure if this is the right section for this but I have a problem when normalizing text. If I print the all_text variable after normalizing, the text will be normalized. Unfortunately when I try to create a bigram out of the text, the unicode characters reappear again.
Not sure if this is the right section for this but I have a problem when normalizing text. If I print the all_text variable after normalizing, the text will be normalized. Unfortunately when I try to create a bigram out of the text, the unicode characters reappear again.
1 2 3 4 |
all_text = unicodedata.normalize( "NFC" , all_text) all_text = sent_tokenize(all_text) bigrams = [b for sentence in all_text for b in zip (sentence.split( " " )[: - 1 ], sentence.split( " " )[ 1 :])] |