Normalizig scraped text

Thread Rating:

0 Vote(s) - 0 Average
1
2
3
4
5

Thread Modes

Normalizig scraped text

wuggs
Unladen Swallow

Posts: 2

Threads: 1

Joined: Jan 2020

Reputation: 0

Jan-06-2020, 01:05 PM (This post was last modified: Jan-07-2020, 03:32 AM by Larz60+.)

Hi,
Not sure if this is the right section for this but I have a problem when normalizing text. If I print the all_text variable after normalizing, the text will be normalized. Unfortunately when I try to create a bigram out of the text, the unicode characters reappear again.

    
    all_text = unicodedata.normalize("NFC", all_text)
    all_text = sent_tokenize(all_text)
    bigrams = [b for sentence in all_text for b in zip(sentence.split(" ")[:-1], sentence.split(" ")[1:])]

Find

Messages In This Thread

Normalizig scraped text - by wuggs - Jan-06-2020, 01:05 PM

RE: Normalizig scraped text - by Larz60+ - Jan-06-2020, 01:45 PM

RE: Normalizig scraped text - by wuggs - Jan-06-2020, 02:21 PM

RE: Normalizig scraped text - by Larz60+ - Jan-07-2020, 03:32 AM

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Weird characters scraped	samuelbachorik	3	898	Oct-29-2023, 02:36 PM Last Post: DeaD_EyE
	Web scraper not populating .txt with scraped data	BlackHeart	5	1,499	Apr-03-2023, 05:12 PM Last Post: snippsat
	Python Obstacles \| Krav Maga \| Wiki Scraped Content [Column Copy]	BrandonKastning	4	2,216	Jan-03-2022, 06:59 AM Last Post: BrandonKastning
	Python Obstacles \| Kapap \| Wiki Scraped Content [Column Nulling]	BrandonKastning	2	1,726	Jan-03-2022, 04:26 AM Last Post: BrandonKastning
	Any way to remove HTML tags from scraped data? (I want text only)	SeBz2020uk	1	3,459	Nov-02-2020, 08:12 PM Last Post: Larz60+
	cant loop through scraped site	matt42	3	2,420	Aug-12-2020, 06:48 AM Last Post: ndc85430
	Parsing infor from scraped files.	Larz60+	2	3,633	Apr-12-2019, 05:06 PM Last Post: Larz60+
	beautiful soup - parsing scraped code in a script	lilbigwill99	2	3,236	Mar-09-2018, 04:10 PM Last Post: lilbigwill99
	Need Tip On Cleaning My BS4 Scraped Data	digitalmatic7	2	3,210	Jan-29-2018, 08:49 PM Last Post: digitalmatic7

Users browsing this thread: 2 Guest(s)

View a Printable Version

Normalizig scraped text

User Panel Messages

Announcements