Python Forum
How to convert unicode to koi8-r? - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: General Coding Help (https://python-forum.io/forum-8.html)
+--- Thread: How to convert unicode to koi8-r? (/thread-16283.html)



How to convert unicode to koi8-r? - AlekseyPython - Feb-21-2019

Python 3.7.2

I write data to the database, so I want to write lines in a format that has a low weight. One character of the Russian language in utf8 occupies 2 bytes, and in the koi8-r encoding only 1 byte (I am only interested in Russian and English alphabets, the other characters can be ignored).

When in Python I convert the English- Russian string to koi8-r, I get a strange sequence:
utf = 'My string, Моя строка'
koi = utf.encode(encoding='koi8-r', errors='ignore')
Output:
koi bytes: b'My string, \\xed\\xcf\\xd1 \\xd3\\xd4\\xd2\\xcf\\xcb\\xc1'
When I write these values ​​to the database, I get write- errors.
How I can convert data in 'small format'?


RE: How to convert unicode to koi8-r? - snippsat - Feb-21-2019

(Feb-21-2019, 10:51 AM)AlekseyPython Wrote: When in Python I convert the English- Russian string to koi8-r, I get a strange sequence:
It's not strange you convert to bytes,in Python 3 are as string(text) Unicode bye default.
Moving to Python 3 Unicode was one the biggest changes.
>>> utf = 'My string, Моя строка'
>>> koi = utf.encode(encoding='koi8-r', errors='ignore')
>>> koi
b'My string, \xed\xcf\xd1 \xd3\xd4\xd2\xcf\xcb\xc1'

>>> koi.decode() # Default here is the same as koi.decode('utf-8')
Traceback (most recent call last):
  File "<interactive input>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 11: invalid continuation byte

# Has to use same encoding to get it back to default text(Unicode) Python 3
# Should try to use utf-8 always,makes it better for everyone
>>> koi.decode('koi8-r')
'My string, Моя строка'
I don't see a reason why you should mess with this at all.
Try avoid encoding/decoding at all,let DB handle if needed.
If example test sqlite3 with Python 3.7.
There is no need to do anything,Python 3 text(Unicode default) will go in out DB without any problems.
E:\div_code\xy
λ ptpython -i dbb.py
>>> bid = Bidbase()
>>> bid.create_db()
 
>>> bid.insert('gleen mars', '345678')
>>> utf = 'My string, Моя строка'
>>> bid.insert('foo', utf)
 
>>> bid.read_all()
gleen mars : 345678
foo : My string, Моя строка



RE: How to convert unicode to koi8-r? - AlekseyPython - Feb-22-2019

(Feb-21-2019, 04:18 PM)snippsat Wrote: I don't see a reason why you should mess with this at all.
Try avoid encoding/decoding at all,let DB handle if needed.
If example test sqlite3 with Python 3.7.
There is no need to do anything,Python 3 text(Unicode default) will go in out DB without any problems.

Thanks you for the answer.
In your case the text will actually be placed in the database, but it will take up twice the amount of disk space. Therefore, reading and writing data from database will be twice as slow. I can set a more economical charset in MariaDB to avoid unnecessary costs. I want to prepare data in Python (excluding only unsupported koi8-r characters), so that they are written to the database as completely as possible.


RE: How to convert unicode to koi8-r? - AlekseyPython - Feb-25-2019

Is there really no way to write to the database in the need encoding?


RE: How to convert unicode to koi8-r? - AlekseyPython - Mar-04-2019

It turned out that when writing string in utf8 encoding, MariaDB uses three bytes. That is, the size of the base increases about three times and the read / write speed drops three times.


RE: How to convert unicode to koi8-r? - DeaD_EyE - Mar-04-2019

As written is the last thread PyMySQL seems to like utf8mb4 as internal encoding.
The difference: https://www.eversql.com/mysql-utf8-vs-utf8mb4-whats-the-difference-between-utf8-and-utf8mb4/

If your have a str, there is no need to encode it, the encoding happens implicit in the library.

If you get data from outside with a different encoding, you have to decode it first with 'koi8-r' for example.
Then you have your str, which is internally encoded with uft8 (or other, depends on Python version) internally.