Python Forum

Full Version: width of Unicode character
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
i have been print()ing Unicode characters (not on paper) from a script that wraps them in double quotes. this overlapped so i added an extra space after the 1st quote and before the 2nd quote. some characters cause the quotes to show closer together as if they occupy no space (without my added space the 2 quotes would be jammed together as if nothing was between them). yet these odd characters still have a glyph that gets shown. in a few cases, the character is so wide it still overlaps the 2nd quote even with the added space (i might need to add more).

i know the displayed result is not controlled by Python. but, is there any data available in Python that can tell how the character will be printed, including right-to-left ones such as Hebrew and Arabic? knowing can help the script format the output (to make a nice dump of all printable characters).
Can you give an example?
I think it depends on the operating system or at least the Python library you are using.
(Sep-26-2021, 02:47 AM)bowlofred Wrote: [ -> ]Can you give an example?

http://ipal.net/python-forum/20210926131...532892.png

this output shows the Unicode code, its decimal value between parenthesis, the UTF-8 octets in hexadecimal, and if printable an ' = ' followed by the raw Unicode character between '" ' and ' "'. note how U+0483 .. U+0489 are shifted left and reduce the total space between the double quotes. this output is formed by xfce4terminal version 4.12 in Xubuntu 18.04.5.
(Sep-26-2021, 03:28 AM)SamHobbs Wrote: [ -> ]I think it depends on the operating system or at least the Python library you are using.

i think the OS (Xubuntu) and Python are just passing the bytes along (the UTF-8 after the Python library does the encoding). i think it is the terminal emulator rendering it that way. i suspect some kind of Unicode standard says to do it that way. what i am hoping for is some kind of data that can describe how to expect it to be rendered (by the terminal emulator).

the script, in this case, wrote the output to a file. it wrote different files based on how long their UTF-8 string would be. this image shows file "2" because these are 2 byte UTF-8 codes.

sources can be accessed at:
http://ipal.net/python-forum/listutf8.py
http://ipal.net/python-forum/to_utf8.py
http://ipal.net/python-forum/un_utf8.py
Python has unicodedata.east_asian_width(), but the information there doesn't seem to correspond to the different ways the characters are displayed.
it seems some characters are intended to go back and overstrike the previous character and have a positional width of zero. i don't know how that should work with wider characters. and have seen at least one that looks to be triple wide while having a positional width of just one. i have seen a few double wide that act different whether followed by a space or not. i think i am going to have to dig into this terminal program code and see how it decides what to do. in the mean time my challenge will be to output a grid of at least 2048 Unicode characters in a way to see the code value easily.