Python Forum

Full Version: i think i need to use my own UTF-8 decoder code
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
i think i need to use my own UTF-8 decoder code:
Output:
lt2a/phil /home/phil 126> py tokenize_stdin.py <sfc.py|cut -c1-132|lineup >/dev/null Traceback (most recent call last): File "/usr/host/bin/lineup", line 8, in <module> for arg in argv if argv else stdin: File "/usr/host/bin/../../lib/python3.6/codecs.py", line 321, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 1895: invalid continuation byte Traceback (most recent call last): File "tokenize_stdin.py", line 6, in <module> print(repr(x)) BrokenPipeError: [Errno 32] Broken pipe lt2a/phil /home/phil 127> py tokenize_stdin.py <sfc.py|lineup >/dev/null lt2a/phil /home/phil 128>
i suspect that cut (in first command pipeline) sliced in the middle of some multi-byte UTF-8 character and it was now being decoded with the newline byte that ends up at the end of a shorter line. the source file (sfc.py) is only ASCII so i am wondering what tokenize.tokenize() put in there that is non-ASCII enough to get up to Unicodes that are multi-byte UTF-8.

what really annoys me is that an exception has to be raised for this in a way that i can't recover from (ignore the character). UTF-8 does general make things difficult. but at least i have already made my own UTF-8 decoder. now, i just need to add some detection of this kind of error and look for possible things that messed it up that is not just a bad encoder (rare that they get put into regular use). for example, if decode comes up with bad Unicode, check for things like a newline byte that would show a bad line cut and just remove the unfinished UTF-8.

people here like to see code, so ...
tokenize_stdin.py:
import tokenize

n = 0
with open(0,'rb') as f:
    for x in tokenize.tokenize(f.readline):
        print(repr(x))
        n += 1
print(f'ALL DONE with {n} tokens')
lineup.py:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""Line up columns of input text."""
from sys import argv,stdin
argv.pop(0)
size = [] # indexed by col number
rows = [] # indexed by row number
for arg in argv if argv else stdin:
    # 2 loops in case 1 argument has 2 or more lines
    for line in arg.splitlines():
        cols = line.split()
        rows.append(cols)
        x = len(cols)
        y = len(size)
        if y<x:
            size[:] = size+(x-y)*[1]
        for n in range(x):
            size[n] = max(size[n],len(cols[n]))
for row in rows:
    new = []
    n = 0
    for col in row:
        if col.isdecimal():
            new.append(col.rjust(size[n]))
        else:
            new.append(col.ljust(size[n]))
        n += 1
    print(' '.join(new).rstrip())
Skaperen Wrote:so i am wondering what tokenize.tokenize() put in there that is non-ASCII enough to get up to Unicodes that are multi-byte UTF-8.
I think I would first investigate this point thoroughly to determine exactly what happened. This is a case where the error can be fully exposed and understood, why not do this?
i wish i had time to investigate everything. i still have Ubuntu "bugs" (things i probably just don't have configured right) i've just lived with for the past few years because i just don't have the time. maybe i'll get another one fixed this month.

now that i am retired i seem to have even less spare time.