Python Forum

Full Version: Unicode character search
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
I rarely have to worry about unicode, especially at the point (character) level.
I'm finding out that I really don't know how to, and can't find a whole lot of help
by searching google (perhaps because I don't know how to formulate my question)

I need to replace certain UTF8 points in my file because Microsoft does not include them
in their UTF8 definition. the self.ms_no_points dictionary causes the error


# from Kebap: May I suggest A I I D Y instead of Á Í Ï Ð Ý


class Utf8stuff:
    def __init__(self, infile_name=None, outfile_name=None):
        self.infile_name = infile_name
        self.outfile_name = outfile_name
        self.ms_no_points = {'\u+081': 'A', '\u+08d': 'I', '\u+08f': 'I', '\u+090':'D', '\u+09d': 'Y'}

        with open(self.infile_name) as f:
            self.inbuff = f.readlines()
        self.process_input()

    def process_input(self):
        linecount = 1
        for line in self.inbuff:
            for key, value in self.ms_no_points.items():
                if key in line:
                    pos = line.index(key)
                    print('found {} at pos: {} in line {}'.format(key, pos, linecount))
            linecount += 1

if __name__ == '__main__':
    ifile = 'er.sql'
    ofile = 'erNew.sql'
    Utf8stuff(infile_name=ifile, outfile_name=ofile)
traceback:
Error:
  File " .../myconv.py", line 9     self.ms_no_points = {'\u+081': 'A', '\u+08d': 'I', '\u+08f': 'I', '\u+090':'D', '\u+09d': 'Y'}                                 ^ SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-1: truncated \uXXXX escape
And so how's it done?
\u is an Unicode escape in Python 3.
Turn around(/) or raw string.
>>> s = '\u'
Traceback (most recent call last):
 File "python", line 1
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-1: truncated \uXXXX escape 

>>> s = '/u'
>>> s
'/u'

>>> s = r'\u'
>>> s
'\\u'  
Thanks snippsat