Aug-28-2021, 01:34 AM
i think i need to use my own UTF-8 decoder code:
what really annoys me is that an exception has to be raised for this in a way that i can't recover from (ignore the character). UTF-8 does general make things difficult. but at least i have already made my own UTF-8 decoder. now, i just need to add some detection of this kind of error and look for possible things that messed it up that is not just a bad encoder (rare that they get put into regular use). for example, if decode comes up with bad Unicode, check for things like a newline byte that would show a bad line cut and just remove the unfinished UTF-8.
people here like to see code, so ...
tokenize_stdin.py:
Output:lt2a/phil /home/phil 126> py tokenize_stdin.py <sfc.py|cut -c1-132|lineup >/dev/null
Traceback (most recent call last):
File "/usr/host/bin/lineup", line 8, in <module>
for arg in argv if argv else stdin:
File "/usr/host/bin/../../lib/python3.6/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 1895: invalid continuation byte
Traceback (most recent call last):
File "tokenize_stdin.py", line 6, in <module>
print(repr(x))
BrokenPipeError: [Errno 32] Broken pipe
lt2a/phil /home/phil 127> py tokenize_stdin.py <sfc.py|lineup >/dev/null
lt2a/phil /home/phil 128>
i suspect that cut (in first command pipeline) sliced in the middle of some multi-byte UTF-8 character and it was now being decoded with the newline byte that ends up at the end of a shorter line. the source file (sfc.py) is only ASCII so i am wondering what tokenize.tokenize() put in there that is non-ASCII enough to get up to Unicodes that are multi-byte UTF-8.what really annoys me is that an exception has to be raised for this in a way that i can't recover from (ignore the character). UTF-8 does general make things difficult. but at least i have already made my own UTF-8 decoder. now, i just need to add some detection of this kind of error and look for possible things that messed it up that is not just a bad encoder (rare that they get put into regular use). for example, if decode comes up with bad Unicode, check for things like a newline byte that would show a bad line cut and just remove the unfinished UTF-8.
people here like to see code, so ...
tokenize_stdin.py:
import tokenize n = 0 with open(0,'rb') as f: for x in tokenize.tokenize(f.readline): print(repr(x)) n += 1 print(f'ALL DONE with {n} tokens')lineup.py:
#!/usr/bin/env python3 # -*- coding: utf-8 -*- """Line up columns of input text.""" from sys import argv,stdin argv.pop(0) size = [] # indexed by col number rows = [] # indexed by row number for arg in argv if argv else stdin: # 2 loops in case 1 argument has 2 or more lines for line in arg.splitlines(): cols = line.split() rows.append(cols) x = len(cols) y = len(size) if y<x: size[:] = size+(x-y)*[1] for n in range(x): size[n] = max(size[n],len(cols[n])) for row in rows: new = [] n = 0 for col in row: if col.isdecimal(): new.append(col.rjust(size[n])) else: new.append(col.ljust(size[n])) n += 1 print(' '.join(new).rstrip())