clean unicode string to contain only characters from some unicode blocks

gmarcon · Nov-22-2018, 06:24 PM

Hi,

I have a unicode string and I need to remove all characters that are not part of the Latin-1 and Latin-1 Supplement Unicode block.

The only way I could get it works is the following:

#0000..007F; Basic Latin
#0080..00FF; Latin-1 Supplement
allowed_chars = map(lambda x: unichr(x).encode('utf-8'), range(0,255))
clean_string = ''.join(char.encode('utf-8') for char in unicode(string,'utf-8') if char.encode('utf-8') in allowed_chars)

Is there a better way? (better = code clearer to read, code efficient, ...)

Thank you for your precious support and regards,
Giulio

gmarcon · Nov-23-2018, 06:51 PM

Ok I got it:

clean_string = ''.join(char for char in string.decode('utf-8') if 0 <= ord(char) <= 255).encode('utf-8')

Unicode is a mess in Python2.. https://nedbatchelder.com/text/unipain.html

**Gribouillis** · (This post was last modified: Nov-23-2018, 09:19 PM by Gribouillis.)

I think you could use the re module for a better performance

regex = re.compile(ur"[^\x00-\xff]+")
clean_string = regex.sub(u"", string.decode('utf8')).encode('utf8')

Why use python 2?

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	RSA Cipher with blocks	Paragoon2	0	480	Nov-26-2023, 04:35 PM Last Post: Paragoon2
	Can i clean this code ?	BSDevo	8	935	Oct-28-2023, 05:50 PM Last Post: BSDevo
	doing string split with 2 or more split characters	Skaperen	22	2,479	Aug-13-2023, 01:57 AM Last Post: Skaperen
	How do I check if the first X characters of a string are numbers?	FirstBornAlbratross	6	1,514	Apr-12-2023, 10:39 AM Last Post: jefsummers
	How to properly format rows and columns in excel data from parsed .txt blocks	jh67	7	1,867	Dec-12-2022, 08:22 PM Last Post: jh67
	Clean Up Script	rotw121	2	1,005	May-25-2022, 03:24 PM Last Post: rotw121
	How to clean UART string	Joni_Engr	4	2,472	Dec-03-2021, 05:58 PM Last Post: deanhystad
	[SOLVED] Delete specific characters from string lines	EnfantNicolas	4	2,199	Oct-21-2021, 11:28 AM Last Post: EnfantNicolas
	width of Unicode character	Skaperen	6	2,703	Sep-27-2021, 12:41 AM Last Post: Skaperen
	is this Unicode printable?	Skaperen	2	1,438	Sep-23-2021, 01:25 AM Last Post: Skaperen

clean unicode string to contain only characters from some unicode blocks

User Panel Messages

Announcements