Python Forum
Normalizig scraped text - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html)
+--- Thread: Normalizig scraped text (/thread-23573.html)



Normalizig scraped text - wuggs - Jan-06-2020

Hi,
Not sure if this is the right section for this but I have a problem when normalizing text. If I print the all_text variable after normalizing, the text will be normalized. Unfortunately when I try to create a bigram out of the text, the unicode characters reappear again.
    
    all_text = unicodedata.normalize("NFC", all_text)
    all_text = sent_tokenize(all_text)
    bigrams = [b for sentence in all_text for b in zip(sentence.split(" ")[:-1], sentence.split(" ")[1:])]    



RE: Normalizig scraped text - Larz60+ - Jan-06-2020

first you set all_text to unicodedata.normalize("NFC", all_text)
then immediately after, you overwrite that with whatever returns from sent_tokenize(all_text)
is that what you intended?


RE: Normalizig scraped text - wuggs - Jan-06-2020

Hi, thanks for answering! Well my intention was to remove all the unicode characters by normalizing the text and then I wanted to separate the text on sentances and store them to an array which is why I used the sent_tokenize method. I don't know how to achieve that by preserving the normalized text.


RE: Normalizig scraped text - Larz60+ - Jan-07-2020

are you still using python 2.7?