Python Forum

Full Version: How to Remove Non-ASCII Characters But Leave Line Breaks In Place?
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
I have a function in a Python script that serves to remove non-ASCII characters from strings before these strings are ultimately saved to an Oracle database.

		
# This should remove any ASCII characters between 0-31 and also ones 127 & up.
     sCleanedString = re.sub(r'[^\x20-\x7E]',r'', sStringToClean)
When I pass in a large string that's the full and complete content of an entire email message to clean, it's stripping out the line break characters and leaving me with a cleaned string that's all jumbled up with no line breaks. For this special case, I'd like to clean the string but leave the line break characters.

Any suggestions on how to modify the above Python string to do what needs to be done?

Thanks!
Don't use regex for this.

def strip_ascii(text):
    return "".join(
        char for char
        in text
        if 31 < ord(char) < 127
    )
sCleanedString = re.sub(r'[^\x0A,\x20-\x7E]',r'*', typestr)
Many thanks for the couple of recommendations on how to address the non-ASCII character situation without removing line break characters!
def strip_ascii(text):
    return "".join(
        char for char
        in text
        if 31 < ord(char) < 127 or char in "\n\r"
    )
Output:
In [21]: print(strip_ascii(""" ...: ...: Das ist ein Test ...: ^^^^ ...: ================ ...: **************** ...: """)) Das ist ein Test ^^^^ ================ ****************
This can be optimized.
31 < ord(char) < 127 or char in "\n\r"
Edit: Second example
Output:
In [24]: print(strip_ascii(""" ...: ...: ääääääääöööööDas ist ein Testööüüüüüüüü ...: ^^^^ßßßßß°°°°°°° ...: =°=°=°=°=°=°=°=°=°=°=°=°=°=°=°= ...: ****************??? ...: """)) Das ist ein Test ^^^^ ================ ****************