Python Forum
Thread Rating:
  • 2 Vote(s) - 1.5 Average
  • 1
  • 2
  • 3
  • 4
  • 5
unicode
#1
the range of ASCII characters that are "printable" is 32 to 126, or 33 to 126, inclusive, depending on whether blank spaces are considered "printable" (they are at least safe to try to print).  i am looking for a list of ranges of "printable" characters in unicode.

i am making some code to dump binary in a "readable" form, showing both the binary code in hexadecimal as well as the character if is a printable one else a '.' in place of each byte.  i have created one for ASCII in C (and there are many others around, in pretty much every languge, i'm sure).  i want to create one in python3 that includes support for UTF-8.  that is, wherever it finds printable byte code combinations, it will output the character in the place where the character goes, depending on the dump style/format.

here is an example from the one i made (in 3 different widths) modeled after a common IBM mainframe dump style:

Output:
lt1/forums /home/forums 6> which xd16 /usr/local/bin/xd16 lt1/forums /home/forums 7> xd16 < /usr/local/bin/xd16 | head -n40 | tail -n8 00000200  52e57464 04000000 281e0000 00000000 |R.td....(.......| 00000210  281e6000 00000000 281e6000 00000000 |(.`.....(.`.....| 00000220  d8010000 00000000 d8010000 00000000 |................| 00000230  01000000 00000000 2f6c6962 36342f6c |......../lib64/l| 00000240  642d6c69 6e75782d 7838362d 36342e73 |d-linux-x86-64.s| 00000250  6f2e3200 04000000 10000000 01000000 |o.2.............| 00000260  474e5500 00000000 02000000 06000000 |GNU.............| 00000270  0f000000 04000000 14000000 03000000 |................| lt1/forums /home/forums 8>
Tradition is peer pressure from dead people

What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Reply
#2
Don't know if there are exact ranges, but take a look at https://docs.python.org/3/library/unicod...a.category and https://en.wikipedia.org/wiki/Template:G..._(Unicode)
Reply
#3
(Nov-15-2017, 08:49 AM)stranac Wrote: i am looking for a list of ranges of "printable" characters in unicode.

They are all 'printable' to a certain extent. Not all code pairs are assigned and there is no font (that I am aware of) that supports all 1114111 utf-8 characters.

That said, these are the code "planes"

Output:
Unicode Blocks Plane 0 Basic Multilingual Plane (BMP) Hex 0000-FFFF Dec 0-65535     Note: D800-DBFF High Surrogate; DC00-DFFF Low Surrogate for utf-16     55296-56319 (High) 56320-57343 (Low) Plane 1 Supplementary Multilingual Plane (SMP) Hex 10000-1FFFF Dec 65536-131071 Plane 2 Supplementary Ideographic Plane (SIP) Hex 20000-2FFFF Dec 131071-196607 Planes 3-13 (unassigned) Hex 30000-DFFFF Dec 196608-917503 Plane 14 Supplementary Special-Purpose Plane (SSP) Hex E0000-E0FFF Dec 917504-921599 Plane 15 Supplementary Private Use Area (S PUA A) F0000-FFFFF 983040-1048575 Plane 16 Supplementary Private Use Area (S PUA B) 100000-10FFFF 1048576-1114111
Two years ago I wrote this code to see what I could 'print'. 

import codecs
#
""" A program to print all unicode glyphs to a file """
#
file = codecs.open("unicode_symbols_v2.txt", "w", "utf-8")
#
# This excludes range 55296-57543, which are surrogate pairs for UTF-16
#
for a in range(0, 1114111):
    #    if chr(a) == chr(0xfffd):
    #        a +=a
    # if a >= 55296 and a <= 57543:
    if 55296 <= a <= 57543:

        a += a
    else:
        file.write('Decimal: ')
        file.write(str(a))
        file.write('  Hex: ')
        file.write(str(hex(a)))
        file.write('  Binary: ')
        file.write(str(bin(a)))
        file.write('  Character: ')
        file.write(str(chr(a)))
        file.write("\n")
        a += a
file.close()
I should update it, since the changes of 3.6, but maybe later.  This results in a rather large text file (~85,000 kb).  Maybe you can glean something useful from it.
If it ain't broke, I just haven't gotten to it yet.
OS: Windows 10, openSuse 42.3, freeBSD 11, Raspian "Stretch"
Python 3.6.5, IDE: PyCharm 2018 Community Edition
Reply
#4
i did try writing some code to try to output a unicode table, or at least a short form of it for codes up to U+07FF.  the result was a mess on the screen, among quite many codes that printed something legible.  i did this with a loop for code in range(0x0800): and encoded the result from chr(code) to UTF-8 and wrote each character, one at a time, directly to the terminal, with output around it to try to make a table structure.  it was not pretty.  so i am trying to see what more i can do.  i don't know how complete the terminal program or the fonts it uses are but have gotten double-wide CJK characters many times.  i will have to figure how how best to display many things in a dump output.  the python unicode database that stranac referred me too looks like it would be useful.
Tradition is peer pressure from dead people

What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Reply
#5
Seems you may always be one step behind  Tongue . Python 3.6.3 supports Unicode Database 9, though the current is Unicode Database 10 (Unicode 10). Apparently, Python is expected to upgrade it's support for UDB 10 with Python 3.7 (What's new in 3.7), but by then, the UDB will probably be 11 or higher.  So if you want to program those new emoji's, your just going to have to wait  Dance
If it ain't broke, I just haven't gotten to it yet.
OS: Windows 10, openSuse 42.3, freeBSD 11, Raspian "Stretch"
Python 3.6.5, IDE: PyCharm 2018 Community Edition
Reply
#6
yeah, the Ubuntu repository tends to be slow.  i wonder if they will ever get past 3.5.2.  i wish PSF could build i386 and x86_64 packages of current versions in .deb and .rpm formats.
Tradition is peer pressure from dead people

What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Reply
#7
Well, if you look at what was added to v10 (aside from the emoji's), it is some pretty obscure stuff, including extinct languages. I doubt there will be any great changes added to what already exists, so things like OS's and programming tools probably don't see it as something that needs immediate attention. Even considering emoji's, I think I've used maybe 5 out of all the ones available to us on this site  [Image: poker.gif] (make that six)
If it ain't broke, I just haven't gotten to it yet.
OS: Windows 10, openSuse 42.3, freeBSD 11, Raspian "Stretch"
Python 3.6.5, IDE: PyCharm 2018 Community Edition
Reply
#8
speaking of extinct languages.  shouldn't perl be in that list, soon?
Tradition is peer pressure from dead people

What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Reply


Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020