How to convert unicode to koi8-r?

AlekseyPython · (This post was last modified: Feb-21-2019, 10:51 AM by AlekseyPython.)

Python 3.7.2

I write data to the database, so I want to write lines in a format that has a low weight. One character of the Russian language in utf8 occupies 2 bytes, and in the koi8-r encoding only 1 byte (I am only interested in Russian and English alphabets, the other characters can be ignored).

When in Python I convert the English- Russian string to koi8-r, I get a strange sequence:

utf = 'My string, Моя строка'
koi = utf.encode(encoding='koi8-r', errors='ignore')

Output:
koi	bytes: b'My string, \\xed\\xcf\\xd1 \\xd3\\xd4\\xd2\\xcf\\xcb\\xc1'

When I write these values to the database, I get write- errors.
How I can convert data in 'small format'?

***snippsat*** · (This post was last modified: Feb-21-2019, 04:18 PM by snippsat.)

(Feb-21-2019, 10:51 AM)AlekseyPython Wrote: When in Python I convert the English- Russian string to koi8-r, I get a strange sequence:

It's not strange you convert to bytes,in Python 3 are as string(text) Unicode bye default.
Moving to Python 3 Unicode was one the biggest changes.

>>> utf = 'My string, Моя строка'
>>> koi = utf.encode(encoding='koi8-r', errors='ignore')
>>> koi
b'My string, \xed\xcf\xd1 \xd3\xd4\xd2\xcf\xcb\xc1'

>>> koi.decode() # Default here is the same as koi.decode('utf-8')
Traceback (most recent call last):
  File "<interactive input>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 11: invalid continuation byte

# Has to use same encoding to get it back to default text(Unicode) Python 3
# Should try to use utf-8 always,makes it better for everyone
>>> koi.decode('koi8-r')
'My string, Моя строка'

I don't see a reason why you should mess with this at all.
Try avoid encoding/decoding at all,let DB handle if needed.
If example test sqlite3 with Python 3.7.
There is no need to do anything,Python 3 text(Unicode default) will go in out DB without any problems.

E:\div_code\xy
λ ptpython -i dbb.py
>>> bid = Bidbase()
>>> bid.create_db()
 
>>> bid.insert('gleen mars', '345678')
>>> utf = 'My string, Моя строка'
>>> bid.insert('foo', utf)
 
>>> bid.read_all()
gleen mars : 345678
foo : My string, Моя строка

AlekseyPython · Feb-22-2019, 08:33 AM

(Feb-21-2019, 04:18 PM)snippsat Wrote: I don't see a reason why you should mess with this at all.
Try avoid encoding/decoding at all,let DB handle if needed.
If example test sqlite3 with Python 3.7.
There is no need to do anything,Python 3 text(Unicode default) will go in out DB without any problems.

Thanks you for the answer.
In your case the text will actually be placed in the database, but it will take up twice the amount of disk space. Therefore, reading and writing data from database will be twice as slow. I can set a more economical charset in MariaDB to avoid unnecessary costs. I want to prepare data in Python (excluding only unsupported koi8-r characters), so that they are written to the database as completely as possible.

AlekseyPython · Feb-25-2019, 05:40 AM

Is there really no way to write to the database in the need encoding?

AlekseyPython · Mar-04-2019, 07:17 AM

It turned out that when writing string in utf8 encoding, MariaDB uses three bytes. That is, the size of the base increases about three times and the read / write speed drops three times.

DeaD_EyE · Mar-04-2019, 08:33 AM

As written is the last thread PyMySQL seems to like utf8mb4 as internal encoding.
The difference: https://www.eversql.com/mysql-utf8-vs-ut...d-utf8mb4/

If your have a str, there is no need to encode it, the encoding happens implicit in the library.

If you get data from outside with a different encoding, you have to decode it first with 'koi8-r' for example.
Then you have your str, which is internally encoded with uft8 (or other, depends on Python version) internally.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Convert Int Value bigger 256 to Unicode	lastyle	4	2,730	Mar-19-2020, 11:48 AM Last Post: lastyle
	clean unicode string to contain only characters from some unicode blocks	gmarcon	2	3,971	Nov-23-2018, 09:17 PM Last Post: Gribouillis

How to convert unicode to koi8-r?

User Panel Messages

Announcements