Posts: 4,646
Threads: 1,493
Joined: Sep 2016
the range of ASCII characters that are "printable" is 32 to 126, or 33 to 126, inclusive, depending on whether blank spaces are considered "printable" (they are at least safe to try to print). i am looking for a list of ranges of "printable" characters in unicode.
i am making some code to dump binary in a "readable" form, showing both the binary code in hexadecimal as well as the character if is a printable one else a '.' in place of each byte. i have created one for ASCII in C (and there are many others around, in pretty much every languge, i'm sure). i want to create one in python3 that includes support for UTF-8. that is, wherever it finds printable byte code combinations, it will output the character in the place where the character goes, depending on the dump style/format.
here is an example from the one i made (in 3 different widths) modeled after a common IBM mainframe dump style:
Output: lt1/forums /home/forums 6> which xd16
/usr/local/bin/xd16
lt1/forums /home/forums 7> xd16 < /usr/local/bin/xd16 | head -n40 | tail -n8
00000200 52e57464 04000000 281e0000 00000000 |R.td....(.......|
00000210 281e6000 00000000 281e6000 00000000 |(.`.....(.`.....|
00000220 d8010000 00000000 d8010000 00000000 |................|
00000230 01000000 00000000 2f6c6962 36342f6c |......../lib64/l|
00000240 642d6c69 6e75782d 7838362d 36342e73 |d-linux-x86-64.s|
00000250 6f2e3200 04000000 10000000 01000000 |o.2.............|
00000260 474e5500 00000000 02000000 06000000 |GNU.............|
00000270 0f000000 04000000 14000000 03000000 |................|
lt1/forums /home/forums 8>
Tradition is peer pressure from dead people
What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Posts: 325
Threads: 11
Joined: Feb 2010
Posts: 1,298
Threads: 38
Joined: Sep 2016
Nov-15-2017, 03:04 PM
(This post was last modified: Nov-15-2017, 03:05 PM by sparkz_alot.)
(Nov-15-2017, 08:49 AM)stranac Wrote: i am looking for a list of ranges of "printable" characters in unicode.
They are all 'printable' to a certain extent. Not all code pairs are assigned and there is no font (that I am aware of) that supports all 1114111 utf-8 characters.
That said, these are the code "planes"
Output: Unicode Blocks
Plane 0
Basic Multilingual Plane (BMP)
Hex 0000-FFFF
Dec 0-65535
Note: D800-DBFF High Surrogate; DC00-DFFF Low Surrogate for utf-16
55296-56319 (High) 56320-57343 (Low)
Plane 1
Supplementary Multilingual Plane (SMP)
Hex 10000-1FFFF
Dec 65536-131071
Plane 2
Supplementary Ideographic Plane (SIP)
Hex 20000-2FFFF
Dec 131071-196607
Planes 3-13
(unassigned)
Hex 30000-DFFFF
Dec 196608-917503
Plane 14
Supplementary Special-Purpose Plane (SSP)
Hex E0000-E0FFF
Dec 917504-921599
Plane 15
Supplementary Private Use Area (S PUA A)
F0000-FFFFF
983040-1048575
Plane 16
Supplementary Private Use Area (S PUA B)
100000-10FFFF
1048576-1114111
Two years ago I wrote this code to see what I could 'print'.
import codecs
#
""" A program to print all unicode glyphs to a file """
#
file = codecs.open("unicode_symbols_v2.txt", "w", "utf-8")
#
# This excludes range 55296-57543, which are surrogate pairs for UTF-16
#
for a in range(0, 1114111):
# if chr(a) == chr(0xfffd):
# a +=a
# if a >= 55296 and a <= 57543:
if 55296 <= a <= 57543:
a += a
else:
file.write('Decimal: ')
file.write(str(a))
file.write(' Hex: ')
file.write(str(hex(a)))
file.write(' Binary: ')
file.write(str(bin(a)))
file.write(' Character: ')
file.write(str(chr(a)))
file.write("\n")
a += a
file.close() I should update it, since the changes of 3.6, but maybe later. This results in a rather large text file (~85,000 kb). Maybe you can glean something useful from it.
If it ain't broke, I just haven't gotten to it yet.
OS: Windows 10, openSuse 42.3, freeBSD 11, Raspian "Stretch"
Python 3.6.5, IDE: PyCharm 2018 Community Edition
Posts: 4,646
Threads: 1,493
Joined: Sep 2016
Nov-16-2017, 03:00 AM
(This post was last modified: Nov-16-2017, 03:00 AM by Skaperen.)
i did try writing some code to try to output a unicode table, or at least a short form of it for codes up to U+07FF. the result was a mess on the screen, among quite many codes that printed something legible. i did this with a loop for code in range(0x0800): and encoded the result from chr(code) to UTF-8 and wrote each character, one at a time, directly to the terminal, with output around it to try to make a table structure. it was not pretty. so i am trying to see what more i can do. i don't know how complete the terminal program or the fonts it uses are but have gotten double-wide CJK characters many times. i will have to figure how how best to display many things in a dump output. the python unicode database that stranac referred me too looks like it would be useful.
Tradition is peer pressure from dead people
What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Posts: 1,298
Threads: 38
Joined: Sep 2016
Seems you may always be one step behind  . Python 3.6.3 supports Unicode Database 9, though the current is Unicode Database 10 ( Unicode 10). Apparently, Python is expected to upgrade it's support for UDB 10 with Python 3.7 ( What's new in 3.7), but by then, the UDB will probably be 11 or higher. So if you want to program those new emoji's, your just going to have to wait
If it ain't broke, I just haven't gotten to it yet.
OS: Windows 10, openSuse 42.3, freeBSD 11, Raspian "Stretch"
Python 3.6.5, IDE: PyCharm 2018 Community Edition
Posts: 4,646
Threads: 1,493
Joined: Sep 2016
yeah, the Ubuntu repository tends to be slow. i wonder if they will ever get past 3.5.2. i wish PSF could build i386 and x86_64 packages of current versions in .deb and .rpm formats.
Tradition is peer pressure from dead people
What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Posts: 1,298
Threads: 38
Joined: Sep 2016
Well, if you look at what was added to v10 (aside from the emoji's), it is some pretty obscure stuff, including extinct languages. I doubt there will be any great changes added to what already exists, so things like OS's and programming tools probably don't see it as something that needs immediate attention. Even considering emoji's, I think I've used maybe 5 out of all the ones available to us on this site ![[Image: poker.gif]](https://python-forum.io//images/extra_smilies/poker.gif) (make that six)
If it ain't broke, I just haven't gotten to it yet.
OS: Windows 10, openSuse 42.3, freeBSD 11, Raspian "Stretch"
Python 3.6.5, IDE: PyCharm 2018 Community Edition
Posts: 4,646
Threads: 1,493
Joined: Sep 2016
speaking of extinct languages. shouldn't perl be in that list, soon?
Tradition is peer pressure from dead people
What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
|