Bottom Page

Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
 Normalizig scraped text
Not sure if this is the right section for this but I have a problem when normalizing text. If I print the all_text variable after normalizing, the text will be normalized. Unfortunately when I try to create a bigram out of the text, the unicode characters reappear again.
    all_text = unicodedata.normalize("NFC", all_text)
    all_text = sent_tokenize(all_text)
    bigrams = [b for sentence in all_text for b in zip(sentence.split(" ")[:-1], sentence.split(" ")[1:])]    
Larz60+ wrote Jan-07-2020, 03:32 AM:
Please post all code, output and errors (in it's entirety) between their respective tags. I did it for you this time, Here are instructions on how to do it yourself next time
first you set all_text to unicodedata.normalize("NFC", all_text)
then immediately after, you overwrite that with whatever returns from sent_tokenize(all_text)
is that what you intended?
Hi, thanks for answering! Well my intention was to remove all the unicode characters by normalizing the text and then I wanted to separate the text on sentances and store them to an array which is why I used the sent_tokenize method. I don't know how to achieve that by preserving the normalized text.
are you still using python 2.7?

Top Page

Possibly Related Threads...
Thread Author Replies Views Last Post
  cant loop through scraped site matt42 3 175 Aug-12-2020, 06:48 AM
Last Post: ndc85430
  Parsing infor from scraped files. Larz60+ 2 886 Apr-12-2019, 05:06 PM
Last Post: Larz60+
  beautiful soup - parsing scraped code in a script lilbigwill99 2 1,200 Mar-09-2018, 04:10 PM
Last Post: lilbigwill99
  Need Tip On Cleaning My BS4 Scraped Data digitalmatic7 2 1,196 Jan-29-2018, 08:49 PM
Last Post: digitalmatic7

Forum Jump:

Users browsing this thread: 1 Guest(s)