Python Forum

i am using lookup tables to define the logic to convert a sequence of octets in UTF-8 form to a Unicode code point. there are actually two tables. the index is in range(256) as well as the value, so does it make sense to use a bytearray? when doing a lookup, the index is always an int, and the looked up value is used like an int. this fits the model of bytearray very well. somewhere i read that bytearray was stored as contiguous bytes somewhere in memory, which should make the lookup indexing very fast. is this true?

here is my logic to build the tables:

num=bytearray([0])*256
bit=bytearray([0])*256
ctl=((0,128,1,255),
     (128,192,0,0),
     (192,224,2,31),
     (224,240,3,15),
     (240,248,4,7),
     (248,252,5.3),
     (252,254,6,1),
     (254,255,7,0),
     (255,256,8,0),
    )
for a,b,c,d in ctl:
    for o in range(a,b):
        num[o]=c
        bit[o]=o&d

This entry in the documentation seems to ensure your requirement of contiguous bytes in memory. However, if you're creating read-only tables, why not use the bytes type directly?

num = bytes(c for a, b, c, d in ctl for o in range(a, b))
bit = bytes(o & d  for a, b, c, d in ctl for o in range(a, b))

i don't require contiguous bytes. but since all the items are < 256, then bytes are usable and they can be contiguous. a contiguous lookup would be faster than, for example, a list of ints, which was in earlier code prototypes. the bytes type does look good in Python3. but in Python2, bytes == str. in some places i split the code based on Python2 vs. Python3, while i try to make most of the code work in both Python2 and Python3. my goal for my UTF-8 code and my Escape Sequence code is to make everything work in 2.7 and 3.x as much as i can.

one thing i am putting some thought into is whether someone might want to get UTF-8 results back in a byte type but had to give Unicode data in a type that supports the full range of Unicode code points (a list of ints in both versions, str in Python3, unicode in Python2). originally i was going to return the same type as given. going from UTF-8 to Unicode is easier to see. even if the UTF-8 is given as bytes, the Unicode result probably can't be, so i will need to return something bigger. going the other way has a different issue. if the Unicode is given as some large type (which the caller usually must do), is that the type they want UTF-8 in? or do they want it in a byte type. so i'm thinking of adding support for a returntype= option to let the caller specify.

oh, now i see the intent of your suggestion. the reason i have the code this way is to ensure that even if i modify the table ctl the generated data will be 256 in length and the various data will be stored in the right place. so they are initialized to the right length with data that is to be there if data from scanning the ctl table happens to not store anything in some location, or the order is changed.

ctl=(
     (192,224,2,31),
     (224,240,3,15),
     (240,248,4,7),
     (252,254,6,1),
     (254,255,7,0),
     (255,256,8,0),
     (0,128,1,255),
    )

I see. It is probably the best way.

Skaperen

Gribouillis

Skaperen

Skaperen

Gribouillis