Python Forum
i think i need to use my own UTF-8 decoder code
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
i think i need to use my own UTF-8 decoder code
#1
i think i need to use my own UTF-8 decoder code:
Output:
lt2a/phil /home/phil 126> py tokenize_stdin.py <sfc.py|cut -c1-132|lineup >/dev/null Traceback (most recent call last): File "/usr/host/bin/lineup", line 8, in <module> for arg in argv if argv else stdin: File "/usr/host/bin/../../lib/python3.6/codecs.py", line 321, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 1895: invalid continuation byte Traceback (most recent call last): File "tokenize_stdin.py", line 6, in <module> print(repr(x)) BrokenPipeError: [Errno 32] Broken pipe lt2a/phil /home/phil 127> py tokenize_stdin.py <sfc.py|lineup >/dev/null lt2a/phil /home/phil 128>
i suspect that cut (in first command pipeline) sliced in the middle of some multi-byte UTF-8 character and it was now being decoded with the newline byte that ends up at the end of a shorter line. the source file (sfc.py) is only ASCII so i am wondering what tokenize.tokenize() put in there that is non-ASCII enough to get up to Unicodes that are multi-byte UTF-8.

what really annoys me is that an exception has to be raised for this in a way that i can't recover from (ignore the character). UTF-8 does general make things difficult. but at least i have already made my own UTF-8 decoder. now, i just need to add some detection of this kind of error and look for possible things that messed it up that is not just a bad encoder (rare that they get put into regular use). for example, if decode comes up with bad Unicode, check for things like a newline byte that would show a bad line cut and just remove the unfinished UTF-8.

people here like to see code, so ...
tokenize_stdin.py:
import tokenize

n = 0
with open(0,'rb') as f:
    for x in tokenize.tokenize(f.readline):
        print(repr(x))
        n += 1
print(f'ALL DONE with {n} tokens')
lineup.py:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""Line up columns of input text."""
from sys import argv,stdin
argv.pop(0)
size = [] # indexed by col number
rows = [] # indexed by row number
for arg in argv if argv else stdin:
    # 2 loops in case 1 argument has 2 or more lines
    for line in arg.splitlines():
        cols = line.split()
        rows.append(cols)
        x = len(cols)
        y = len(size)
        if y<x:
            size[:] = size+(x-y)*[1]
        for n in range(x):
            size[n] = max(size[n],len(cols[n]))
for row in rows:
    new = []
    n = 0
    for col in row:
        if col.isdecimal():
            new.append(col.rjust(size[n]))
        else:
            new.append(col.ljust(size[n]))
        n += 1
    print(' '.join(new).rstrip())
Tradition is peer pressure from dead people

What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Reply
#2
Skaperen Wrote:so i am wondering what tokenize.tokenize() put in there that is non-ASCII enough to get up to Unicodes that are multi-byte UTF-8.
I think I would first investigate this point thoroughly to determine exactly what happened. This is a case where the error can be fully exposed and understood, why not do this?
Reply
#3
i wish i had time to investigate everything. i still have Ubuntu "bugs" (things i probably just don't have configured right) i've just lived with for the past few years because i just don't have the time. maybe i'll get another one fixed this month.

now that i am retired i seem to have even less spare time.
Tradition is peer pressure from dead people

What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Reply


Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020