Python Forum
clean unicode string to contain only characters from some unicode blocks
Thread Rating:
  • 1 Vote(s) - 1 Average
  • 1
  • 2
  • 3
  • 4
  • 5
clean unicode string to contain only characters from some unicode blocks
#1
Hi,

I have a unicode string and I need to remove all characters that are not part of the Latin-1 and Latin-1 Supplement Unicode block.

The only way I could get it works is the following:

#0000..007F; Basic Latin
#0080..00FF; Latin-1 Supplement
allowed_chars = map(lambda x: unichr(x).encode('utf-8'), range(0,255))
clean_string = ''.join(char.encode('utf-8') for char in unicode(string,'utf-8') if char.encode('utf-8') in allowed_chars)

Is there a better way? (better = code clearer to read, code efficient, ...)

Thank you for your precious support and regards,
Giulio
Reply
#2
Ok I got it:

clean_string = ''.join(char for char in string.decode('utf-8') if 0 <= ord(char) <= 255).encode('utf-8')

Unicode is a mess in Python2.. https://nedbatchelder.com/text/unipain.html
Reply
#3
I think you could use the re module for a better performance
regex = re.compile(ur"[^\x00-\xff]+")
clean_string = regex.sub(u"", string.decode('utf8')).encode('utf8')
Why use python 2?
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Extract continuous numeric characters from a string in Python Robotguy 2 427 Jan-16-2021, 12:44 AM
Last Post: snippsat
  Am I a retard - else and finally blocks in a try statement RubenF85 6 459 Jan-12-2021, 05:56 PM
Last Post: bowlofred
  Python win32api keybd_event: How do I input a string of characters? JaneTan 3 599 Oct-19-2020, 04:16 AM
Last Post: deanhystad
  Unicode lists broodmdh 1 568 Jun-26-2020, 06:21 PM
Last Post: Gribouillis
  unicode question DPaul 5 846 Jun-19-2020, 03:06 PM
Last Post: DPaul
  How to get first two characters in a string scratchmyhead 2 632 May-19-2020, 11:00 AM
Last Post: scratchmyhead
  Remove escape characters / Unicode characters from string DreamingInsanity 5 2,605 May-15-2020, 01:37 PM
Last Post: snippsat
  python-resize-image unicode decode error Pedroski55 3 1,027 Apr-21-2020, 10:56 AM
Last Post: Pedroski55
  How to tabulate correctly repeated blocks? Xiesxes 4 965 Mar-21-2020, 04:57 PM
Last Post: Xiesxes
  Convert Int Value bigger 256 to Unicode lastyle 4 813 Mar-19-2020, 11:48 AM
Last Post: lastyle

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020