Python Forum
tool wanted: to convert utf8 <-> unicode in hex
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
tool wanted: to convert utf8 <-> unicode in hex
#1
i am looking for a tool to convert both ways between utf8 and unicode without using any builtin python conversion (if it is written in python) for the purpose of cross checking usage of python conversion code to be sure it is correct. this tool needs to come in easily readable and compilable/runnable source code. it needs to be organized in a simple way to make correctness verification easier. C and/or Pike and/or Python would be preferred because i can more easily read those. else i'll end up making my own in C some day. it needs to take input in args or STDIN (if args are empty) and output the converted to STDOUT.
Tradition is peer pressure from dead people

What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Reply
#2
>>> hex(ord('Я'))
'0x42f'
"As they say in Mexico 'dosvidaniya'. That makes two vidaniyas."
https://freedns.afraid.org
Reply
#3
In linux, you have iconv:
Output:
iconv --from-code=UTF-8 --to-code=UTF-16 < input.txt > output.txt
It uses the C function iconv fron libiconv so writting your tiny wrapper in C around it shall not be a big issue.
Reply
#4
(May-20-2018, 05:57 AM)wavic Wrote:
>>> hex(ord('Я'))
'0x42f'
so how does this convert to utf8?

(May-20-2018, 10:24 AM)killerrex Wrote: In linux, you have iconv:
Output:
iconv --from-code=UTF-8 --to-code=UTF-16 < input.txt > output.txt
It uses the C function iconv fron libiconv so writting your tiny wrapper in C around it shall not be a big issue.


it doesn't seem to do any useful conversion.
Output:
lt1/forums /home/forums 5> echo 042f | iconv --to-code=UTF-8 042f lt1/forums /home/forums 6>
maybe i should have given an example.
Output:
> uconv 042f d0 af > uconv d0 94 0414 >
oh, i failed to mention, input and output should be in hexadecimal by default, but other base defaults are acceptable as long as hexadecimal can be done.
Output:
> uconv -x 0x042f 0xd0 0xaf > uconv -x 0xd0 0x94 0x0414 >
or tracking the input base to output is good.
Output:
> uconv 0x042f 0xd0 0xaf > uconv 0xd0 0x94 0x0414 >
Tradition is peer pressure from dead people

What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Reply
#5
Quote:so how does this convert to utf8?
>>> chr(int('0x42f', 16))
'Я'
"As they say in Mexico 'dosvidaniya'. That makes two vidaniyas."
https://freedns.afraid.org
Reply
#6
(May-21-2018, 03:06 AM)wavic Wrote:
Quote:so how does this convert to utf8?
>>> chr(int('0x42f', 16))
'Я'

what non-python tool does it use to do such conversion?

what program is this being run under? out of speculation, i tried running this under python3.5.2 but did not get the same results:
Output:
lt1/forums /home/forums 11> python3 Python 3.5.2 (default, Nov 23 2017, 16:37:01) [GCC 5.4.0 20160609] on linux Type "help", "copyright", "credits" or "license" for more information. >>> chr(int('0x42f', 16)) '\u042f' >>> lt1/forums /home/forums 12>
i am going to be running this to verify if i am getting the correct conversion in python. ideally this would be a command that has the actual conversion code in it. so there woul e some form of bit operations or the arithmetic equivalent (*2 is like a left shift of 1)
Tradition is peer pressure from dead people

What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Reply
#7
You are missing the necessary locals or fonts so it shows you the Unicode value. It's a Cyrillic letter.

Python 3.6.4 (default, Jan  5 2018, 02:35:40) 
[GCC 7.2.1 20171224] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> chr(int('0x42f', 16))
'Я'
"As they say in Mexico 'dosvidaniya'. That makes two vidaniyas."
https://freedns.afraid.org
Reply
#8
i do get Cyrillic just fine and readable (i can read some of it) when the command/program outputs genuine UTF-8 bytes. it works for me when i do os.write(1,...) of byte-type data. i think there is some config setting i don't have to cause print() to convert Unicode to UTF-8 by default (maybe term type).
Output:
lt1/forums /home/forums 13> py3 Python 3.5.2 (default, Nov 23 2017, 16:37:01) [GCC 5.4.0 20160609] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import os >>> x=os.write(1,b'\xd0\xaf\x0a') Я >>> lt1/forums /home/forums 14>

visit youtube and search for огонь дома. one of my favorite video viewing topics.
Tradition is peer pressure from dead people

What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Reply
#9
Iconv works with the raw values, if you have it as a hex string you need to transform it first.
And the result is also raw bits, so:
Output:
$> echo "にほんご" | iconv -f UTF-8 -t UTF-16 | xxd -p fffe6b307b30933054300a00 $> echo -n fffe6b307b30933054300a00 | xxd -r -p | iconv -f UTF-16 -t UTF-8 にほんご
You need to have some font with japanese characters to see this correctly...

I find xxd more practical than hexdump, although you can achieve similar results.
Reply
#10
I like this.
"As they say in Mexico 'dosvidaniya'. That makes two vidaniyas."
https://freedns.afraid.org
Reply


Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020