Normalizig scraped text

wuggs · (This post was last modified: Jan-07-2020, 03:32 AM by Larz60+.)

Hi,
Not sure if this is the right section for this but I have a problem when normalizing text. If I print the all_text variable after normalizing, the text will be normalized. Unfortunately when I try to create a bigram out of the text, the unicode characters reappear again.

        
all_text = unicodedata.normalize("NFC", all_text)
all_text = sent_tokenize(all_text)
bigrams = [b for sentence in all_text for b in zip(sentence.split(" ")[:-1], sentence.split(" ")[1:])]

**Larz60+** · Jan-06-2020, 01:45 PM

first you set all_text to unicodedata.normalize("NFC", all_text)
then immediately after, you overwrite that with whatever returns from sent_tokenize(all_text)
is that what you intended?

wuggs · Jan-06-2020, 02:21 PM

Hi, thanks for answering! Well my intention was to remove all the unicode characters by normalizing the text and then I wanted to separate the text on sentances and store them to an array which is why I used the sent_tokenize method. I don't know how to achieve that by preserving the normalized text.

**Larz60+** · Jan-07-2020, 03:32 AM

are you still using python 2.7?

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Weird characters scraped	samuelbachorik	3	2,136	Oct-29-2023, 02:36 PM Last Post: DeaD_EyE
	Web scraper not populating .txt with scraped data	BlackHeart	5	2,869	Apr-03-2023, 05:12 PM Last Post: snippsat
	Python Obstacles \| Krav Maga \| Wiki Scraped Content [Column Copy]	BrandonKastning	4	3,369	Jan-03-2022, 06:59 AM Last Post: BrandonKastning
	Python Obstacles \| Kapap \| Wiki Scraped Content [Column Nulling]	BrandonKastning	2	2,629	Jan-03-2022, 04:26 AM Last Post: BrandonKastning
	Any way to remove HTML tags from scraped data? (I want text only)	SeBz2020uk	1	4,599	Nov-02-2020, 08:12 PM Last Post: Larz60+
	cant loop through scraped site	matt42	3	3,414	Aug-12-2020, 06:48 AM Last Post: ndc85430
	Parsing infor from scraped files.	Larz60+	2	4,712	Apr-12-2019, 05:06 PM Last Post: Larz60+
	beautiful soup - parsing scraped code in a script	lilbigwill99	2	4,125	Mar-09-2018, 04:10 PM Last Post: lilbigwill99
	Need Tip On Cleaning My BS4 Scraped Data	digitalmatic7	2	4,091	Jan-29-2018, 08:49 PM Last Post: digitalmatic7

Normalizig scraped text

User Panel Messages

Announcements