Python Forum
clean unicode string to contain only characters from some unicode blocks - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: General Coding Help (https://python-forum.io/forum-8.html)
+--- Thread: clean unicode string to contain only characters from some unicode blocks (/thread-14286.html)



clean unicode string to contain only characters from some unicode blocks - gmarcon - Nov-22-2018

Hi,

I have a unicode string and I need to remove all characters that are not part of the Latin-1 and Latin-1 Supplement Unicode block.

The only way I could get it works is the following:

#0000..007F; Basic Latin
#0080..00FF; Latin-1 Supplement
allowed_chars = map(lambda x: unichr(x).encode('utf-8'), range(0,255))
clean_string = ''.join(char.encode('utf-8') for char in unicode(string,'utf-8') if char.encode('utf-8') in allowed_chars)

Is there a better way? (better = code clearer to read, code efficient, ...)

Thank you for your precious support and regards,
Giulio


RE: clean unicode string to contain only characters from some unicode blocks - gmarcon - Nov-23-2018

Ok I got it:

clean_string = ''.join(char for char in string.decode('utf-8') if 0 <= ord(char) <= 255).encode('utf-8')

Unicode is a mess in Python2.. https://nedbatchelder.com/text/unipain.html


RE: clean unicode string to contain only characters from some unicode blocks - Gribouillis - Nov-23-2018

I think you could use the re module for a better performance
regex = re.compile(ur"[^\x00-\xff]+")
clean_string = regex.sub(u"", string.decode('utf8')).encode('utf8')
Why use python 2?