Bottom Page

Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
 Normalizig scraped text
#1
Hi,
Not sure if this is the right section for this but I have a problem when normalizing text. If I print the all_text variable after normalizing, the text will be normalized. Unfortunately when I try to create a bigram out of the text, the unicode characters reappear again.
    
    all_text = unicodedata.normalize("NFC", all_text)
    all_text = sent_tokenize(all_text)
    bigrams = [b for sentence in all_text for b in zip(sentence.split(" ")[:-1], sentence.split(" ")[1:])]    
Larz60+ wrote Jan-07-2020, 03:32 AM:
Please post all code, output and errors (in it's entirety) between their respective tags. I did it for you this time, Here are instructions on how to do it yourself next time
Quote
#2
first you set all_text to unicodedata.normalize("NFC", all_text)
then immediately after, you overwrite that with whatever returns from sent_tokenize(all_text)
is that what you intended?
Quote
#3
Hi, thanks for answering! Well my intention was to remove all the unicode characters by normalizing the text and then I wanted to separate the text on sentances and store them to an array which is why I used the sent_tokenize method. I don't know how to achieve that by preserving the normalized text.
Quote
#4
are you still using python 2.7?
Quote

Top Page

Possibly Related Threads...
Thread Author Replies Views Last Post
  Parsing infor from scraped files. Larz60+ 2 543 Apr-12-2019, 05:06 PM
Last Post: Larz60+
  beautiful soup - parsing scraped code in a script lilbigwill99 2 831 Mar-09-2018, 04:10 PM
Last Post: lilbigwill99
  Need Tip On Cleaning My BS4 Scraped Data digitalmatic7 2 900 Jan-29-2018, 08:49 PM
Last Post: digitalmatic7

Forum Jump:


Users browsing this thread: 1 Guest(s)