tool wanted: to convert utf8 <-> unicode in hex

Skaperen · May-20-2018, 04:47 AM

i am looking for a tool to convert both ways between utf8 and unicode without using any builtin python conversion (if it is written in python) for the purpose of cross checking usage of python conversion code to be sure it is correct. this tool needs to come in easily readable and compilable/runnable source code. it needs to be organized in a simple way to make correctness verification easier. C and/or Pike and/or Python would be preferred because i can more easily read those. else i'll end up making my own in C some day. it needs to take input in args or STDIN (if args are empty) and output the converted to STDOUT.

wavic · May-20-2018, 05:57 AM

>>> hex(ord('Я'))
'0x42f'

killerrex · May-20-2018, 10:24 AM

In linux, you have iconv:

Output:
iconv --from-code=UTF-8 --to-code=UTF-16 < input.txt > output.txt

It uses the C function iconv fron libiconv so writting your tiny wrapper in C around it shall not be a big issue.

Skaperen · (This post was last modified: May-21-2018, 03:01 AM by Skaperen.)

(May-20-2018, 05:57 AM)wavic Wrote:
>>> hex(ord('Я'))
'0x42f'

so how does this convert to utf8?

(May-20-2018, 10:24 AM)killerrex Wrote: In linux, you have iconv:
Output:
iconv --from-code=UTF-8 --to-code=UTF-16 < input.txt > output.txt
It uses the C function iconv fron libiconv so writting your tiny wrapper in C around it shall not be a big issue.

it doesn't seem to do any useful conversion.

Output:lt1/forums /home/forums 5> echo 042f | iconv --to-code=UTF-8
042f
lt1/forums /home/forums 6>

maybe i should have given an example.

Output:> uconv 042f
d0 af
> uconv d0 94
0414
>

oh, i failed to mention, input and output should be in hexadecimal by default, but other base defaults are acceptable as long as hexadecimal can be done.

Output:> uconv -x 0x042f
0xd0 0xaf
> uconv -x 0xd0 0x94
0x0414
>

or tracking the input base to output is good.

Output:> uconv 0x042f
0xd0 0xaf
> uconv 0xd0 0x94
0x0414
>

wavic · (This post was last modified: May-21-2018, 03:06 AM by wavic.)

Quote:so how does this convert to utf8?

>>> chr(int('0x42f', 16))
'Я'

Skaperen · May-21-2018, 03:46 AM

(May-21-2018, 03:06 AM)wavic Wrote:
Quote:so how does this convert to utf8?
>>> chr(int('0x42f', 16))
'Я'

what non-python tool does it use to do such conversion?

what program is this being run under? out of speculation, i tried running this under python3.5.2 but did not get the same results:

Output:lt1/forums /home/forums 11> python3
Python 3.5.2 (default, Nov 23 2017, 16:37:01) 
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> chr(int('0x42f', 16))
'\u042f'
>>> 
lt1/forums /home/forums 12>

i am going to be running this to verify if i am getting the correct conversion in python. ideally this would be a command that has the actual conversion code in it. so there woul e some form of bit operations or the arithmetic equivalent (*2 is like a left shift of 1)

wavic · May-21-2018, 04:03 AM

You are missing the necessary locals or fonts so it shows you the Unicode value. It's a Cyrillic letter.

Python 3.6.4 (default, Jan  5 2018, 02:35:40) 
[GCC 7.2.1 20171224] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> chr(int('0x42f', 16))
'Я'

Skaperen · (This post was last modified: May-21-2018, 06:15 AM by Skaperen.)

i do get Cyrillic just fine and readable (i can read some of it) when the command/program outputs genuine UTF-8 bytes. it works for me when i do os.write(1,...) of byte-type data. i think there is some config setting i don't have to cause print() to convert Unicode to UTF-8 by default (maybe term type).

Output:lt1/forums /home/forums 13> py3
Python 3.5.2 (default, Nov 23 2017, 16:37:01) 
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> x=os.write(1,b'\xd0\xaf\x0a')
Я
>>> 
lt1/forums /home/forums 14>

visit youtube and search for огонь дома. one of my favorite video viewing topics.

killerrex · May-21-2018, 10:27 AM

Iconv works with the raw values, if you have it as a hex string you need to transform it first.
And the result is also raw bits, so:

Output:$> echo "にほんご" | iconv -f UTF-8 -t UTF-16 | xxd -p
fffe6b307b30933054300a00
$> echo -n fffe6b307b30933054300a00 | xxd -r -p | iconv -f UTF-16 -t UTF-8
にほんご

You need to have some font with japanese characters to see this correctly...

I find xxd more practical than hexdump, although you can achieve similar results.

wavic · May-21-2018, 10:31 AM

I like this.

tool wanted: to convert utf8 <-> unicode in hex

User Panel Messages

Announcements