Python Forum
clean unicode string to contain only characters from some unicode blocks
Thread Rating:
  • 1 Vote(s) - 1 Average
  • 1
  • 2
  • 3
  • 4
  • 5
clean unicode string to contain only characters from some unicode blocks
#1
Hi,

I have a unicode string and I need to remove all characters that are not part of the Latin-1 and Latin-1 Supplement Unicode block.

The only way I could get it works is the following:

#0000..007F; Basic Latin
#0080..00FF; Latin-1 Supplement
allowed_chars = map(lambda x: unichr(x).encode('utf-8'), range(0,255))
clean_string = ''.join(char.encode('utf-8') for char in unicode(string,'utf-8') if char.encode('utf-8') in allowed_chars)

Is there a better way? (better = code clearer to read, code efficient, ...)

Thank you for your precious support and regards,
Giulio
Reply
#2
Ok I got it:

clean_string = ''.join(char for char in string.decode('utf-8') if 0 <= ord(char) <= 255).encode('utf-8')

Unicode is a mess in Python2.. https://nedbatchelder.com/text/unipain.html
Reply
#3
I think you could use the re module for a better performance
regex = re.compile(ur"[^\x00-\xff]+")
clean_string = regex.sub(u"", string.decode('utf8')).encode('utf8')
Why use python 2?
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  RSA Cipher with blocks Paragoon2 0 453 Nov-26-2023, 04:35 PM
Last Post: Paragoon2
  Can i clean this code ? BSDevo 8 850 Oct-28-2023, 05:50 PM
Last Post: BSDevo
  doing string split with 2 or more split characters Skaperen 22 2,318 Aug-13-2023, 01:57 AM
Last Post: Skaperen
  How do I check if the first X characters of a string are numbers? FirstBornAlbratross 6 1,429 Apr-12-2023, 10:39 AM
Last Post: jefsummers
  How to properly format rows and columns in excel data from parsed .txt blocks jh67 7 1,802 Dec-12-2022, 08:22 PM
Last Post: jh67
  Clean Up Script rotw121 2 981 May-25-2022, 03:24 PM
Last Post: rotw121
  How to clean UART string Joni_Engr 4 2,414 Dec-03-2021, 05:58 PM
Last Post: deanhystad
Question [SOLVED] Delete specific characters from string lines EnfantNicolas 4 2,143 Oct-21-2021, 11:28 AM
Last Post: EnfantNicolas
  width of Unicode character Skaperen 6 2,649 Sep-27-2021, 12:41 AM
Last Post: Skaperen
  is this Unicode printable? Skaperen 2 1,409 Sep-23-2021, 01:25 AM
Last Post: Skaperen

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020