unicode

Skaperen · Nov-15-2017, 01:52 AM

the range of ASCII characters that are "printable" is 32 to 126, or 33 to 126, inclusive, depending on whether blank spaces are considered "printable" (they are at least safe to try to print). i am looking for a list of ranges of "printable" characters in unicode.

i am making some code to dump binary in a "readable" form, showing both the binary code in hexadecimal as well as the character if is a printable one else a '.' in place of each byte. i have created one for ASCII in C (and there are many others around, in pretty much every languge, i'm sure). i want to create one in python3 that includes support for UTF-8. that is, wherever it finds printable byte code combinations, it will output the character in the place where the character goes, depending on the dump style/format.

here is an example from the one i made (in 3 different widths) modeled after a common IBM mainframe dump style:

Output:lt1/forums /home/forums 6> which xd16
/usr/local/bin/xd16
lt1/forums /home/forums 7> xd16 < /usr/local/bin/xd16 | head -n40 | tail -n8
00000200  52e57464 04000000 281e0000 00000000 |R.td....(.......|
00000210  281e6000 00000000 281e6000 00000000 |(.`.....(.`.....|
00000220  d8010000 00000000 d8010000 00000000 |................|
00000230  01000000 00000000 2f6c6962 36342f6c |......../lib64/l|
00000240  642d6c69 6e75782d 7838362d 36342e73 |d-linux-x86-64.s|
00000250  6f2e3200 04000000 10000000 01000000 |o.2.............|
00000260  474e5500 00000000 02000000 06000000 |GNU.............|
00000270  0f000000 04000000 14000000 03000000 |................|
lt1/forums /home/forums 8>

***stranac*** · Nov-15-2017, 08:49 AM

Don't know if there are exact ranges, but take a look at https://docs.python.org/3/library/unicod...a.category and https://en.wikipedia.org/wiki/Template:G..._(Unicode)

***sparkz_alot*** · (This post was last modified: Nov-15-2017, 03:05 PM by sparkz_alot.)

(Nov-15-2017, 08:49 AM)stranac Wrote: i am looking for a list of ranges of "printable" characters in unicode.

They are all 'printable' to a certain extent. Not all code pairs are assigned and there is no font (that I am aware of) that supports all 1114111 utf-8 characters.

That said, these are the code "planes"

Output:Unicode Blocks

Plane 0
Basic Multilingual Plane (BMP)
Hex 0000-FFFF
Dec 0-65535
    Note: D800-DBFF High Surrogate; DC00-DFFF Low Surrogate for utf-16
    55296-56319 (High) 56320-57343 (Low)

Plane 1
Supplementary Multilingual Plane (SMP)
Hex 10000-1FFFF
Dec 65536-131071

Plane 2
Supplementary Ideographic Plane (SIP)
Hex 20000-2FFFF
Dec 131071-196607

Planes 3-13
(unassigned)
Hex 30000-DFFFF
Dec 196608-917503

Plane 14
Supplementary Special-Purpose Plane (SSP)
Hex E0000-E0FFF
Dec 917504-921599

Plane 15
Supplementary Private Use Area (S PUA A)
F0000-FFFFF
983040-1048575

Plane 16
Supplementary Private Use Area (S PUA B)
100000-10FFFF
1048576-1114111

Two years ago I wrote this code to see what I could 'print'.

import codecs
#
""" A program to print all unicode glyphs to a file """
#
file = codecs.open("unicode_symbols_v2.txt", "w", "utf-8")
#
# This excludes range 55296-57543, which are surrogate pairs for UTF-16
#
for a in range(0, 1114111):
    #    if chr(a) == chr(0xfffd):
    #        a +=a
    # if a >= 55296 and a <= 57543:
    if 55296 <= a <= 57543:

        a += a
    else:
        file.write('Decimal: ')
        file.write(str(a))
        file.write('  Hex: ')
        file.write(str(hex(a)))
        file.write('  Binary: ')
        file.write(str(bin(a)))
        file.write('  Character: ')
        file.write(str(chr(a)))
        file.write("\n")
        a += a
file.close()

I should update it, since the changes of 3.6, but maybe later. This results in a rather large text file (~85,000 kb). Maybe you can glean something useful from it.

Skaperen · (This post was last modified: Nov-16-2017, 03:00 AM by Skaperen.)

i did try writing some code to try to output a unicode table, or at least a short form of it for codes up to U+07FF. the result was a mess on the screen, among quite many codes that printed something legible. i did this with a loop for code in range(0x0800): and encoded the result from chr(code) to UTF-8 and wrote each character, one at a time, directly to the terminal, with output around it to try to make a table structure. it was not pretty. so i am trying to see what more i can do. i don't know how complete the terminal program or the fonts it uses are but have gotten double-wide CJK characters many times. i will have to figure how how best to display many things in a dump output. the python unicode database that stranac referred me too looks like it would be useful.

***sparkz_alot*** · Nov-16-2017, 02:43 PM

Seems you may always be one step behind Tongue

. Python 3.6.3 supports Unicode Database 9, though the current is Unicode Database 10 (Unicode 10). Apparently, Python is expected to upgrade it's support for UDB 10 with Python 3.7 (What's new in 3.7), but by then, the UDB will probably be 11 or higher. So if you want to program those new emoji's, your just going to have to wait Dance

Skaperen · Nov-17-2017, 02:50 AM

yeah, the Ubuntu repository tends to be slow. i wonder if they will ever get past 3.5.2. i wish PSF could build i386 and x86_64 packages of current versions in .deb and .rpm formats.

***sparkz_alot*** · Nov-17-2017, 02:05 PM

Well, if you look at what was added to v10 (aside from the emoji's), it is some pretty obscure stuff, including extinct languages. I doubt there will be any great changes added to what already exists, so things like OS's and programming tools probably don't see it as something that needs immediate attention. Even considering emoji's, I think I've used maybe 5 out of all the ones available to us on this site [Image: poker.gif]

(make that six)

Skaperen · Nov-18-2017, 08:10 AM

speaking of extinct languages. shouldn't perl be in that list, soon?

unicode

User Panel Messages

Announcements