Bottom Page

Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
 How to convert unicode to koi8-r?
#1
Python 3.7.2

I write data to the database, so I want to write lines in a format that has a low weight. One character of the Russian language in utf8 occupies 2 bytes, and in the koi8-r encoding only 1 byte (I am only interested in Russian and English alphabets, the other characters can be ignored).

When in Python I convert the English- Russian string to koi8-r, I get a strange sequence:
utf = 'My string, Моя строка'
koi = utf.encode(encoding='koi8-r', errors='ignore')
Output:
koi bytes: b'My string, \\xed\\xcf\\xd1 \\xd3\\xd4\\xd2\\xcf\\xcb\\xc1'
When I write these values ​​to the database, I get write- errors.
How I can convert data in 'small format'?
Quote
#2
(Feb-21-2019, 10:51 AM)AlekseyPython Wrote: When in Python I convert the English- Russian string to koi8-r, I get a strange sequence:
It's not strange you convert to bytes,in Python 3 are as string(text) Unicode bye default.
Moving to Python 3 Unicode was one the biggest changes.
>>> utf = 'My string, Моя строка'
>>> koi = utf.encode(encoding='koi8-r', errors='ignore')
>>> koi
b'My string, \xed\xcf\xd1 \xd3\xd4\xd2\xcf\xcb\xc1'

>>> koi.decode() # Default here is the same as koi.decode('utf-8')
Traceback (most recent call last):
  File "<interactive input>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 11: invalid continuation byte

# Has to use same encoding to get it back to default text(Unicode) Python 3
# Should try to use utf-8 always,makes it better for everyone
>>> koi.decode('koi8-r')
'My string, Моя строка'
I don't see a reason why you should mess with this at all.
Try avoid encoding/decoding at all,let DB handle if needed.
If example test sqlite3 with Python 3.7.
There is no need to do anything,Python 3 text(Unicode default) will go in out DB without any problems.
E:\div_code\xy
λ ptpython -i dbb.py
>>> bid = Bidbase()
>>> bid.create_db()
 
>>> bid.insert('gleen mars', '345678')
>>> utf = 'My string, Моя строка'
>>> bid.insert('foo', utf)
 
>>> bid.read_all()
gleen mars : 345678
foo : My string, Моя строка
Quote
#3
(Feb-21-2019, 04:18 PM)snippsat Wrote: I don't see a reason why you should mess with this at all.
Try avoid encoding/decoding at all,let DB handle if needed.
If example test sqlite3 with Python 3.7.
There is no need to do anything,Python 3 text(Unicode default) will go in out DB without any problems.

Thanks you for the answer.
In your case the text will actually be placed in the database, but it will take up twice the amount of disk space. Therefore, reading and writing data from database will be twice as slow. I can set a more economical charset in MariaDB to avoid unnecessary costs. I want to prepare data in Python (excluding only unsupported koi8-r characters), so that they are written to the database as completely as possible.
Quote
#4
Is there really no way to write to the database in the need encoding?
Quote
#5
It turned out that when writing string in utf8 encoding, MariaDB uses three bytes. That is, the size of the base increases about three times and the read / write speed drops three times.
Quote
#6
As written is the last thread PyMySQL seems to like utf8mb4 as internal encoding.
The difference: https://www.eversql.com/mysql-utf8-vs-ut...d-utf8mb4/

If your have a str, there is no need to encode it, the encoding happens implicit in the library.

If you get data from outside with a different encoding, you have to decode it first with 'koi8-r' for example.
Then you have your str, which is internally encoded with uft8 (or other, depends on Python version) internally.
My code examples are always for Python >=3.6.0
Almost dead, but too lazy to die: https://sourceserver.info
All humans together. We don't need politicians!
Quote

Top Page

Possibly Related Threads...
Thread Author Replies Views Last Post
  Unicode error deedeed 5 249 Jun-28-2019, 06:15 AM
Last Post: deedeed
  # of bytes used to store a Unicode character insearchofanswers87 3 290 Jan-19-2019, 04:01 PM
Last Post: ichabod801
  with source file in UTF-8 you have full UNICODE support Erhy 3 387 Jan-06-2019, 04:05 PM
Last Post: Erhy
  clean unicode string to contain only characters from some unicode blocks gmarcon 2 475 Nov-23-2018, 09:17 PM
Last Post: Gribouillis
  Python 2 Unicode question Skaperen 2 472 Sep-08-2018, 06:44 AM
Last Post: Skaperen
  json.dumps save some unicode chars as code, not char buran 1 497 Aug-02-2018, 04:02 PM
Last Post: buran
  unicode weirdness Skaperen 2 542 Jul-30-2018, 07:08 PM
Last Post: Skaperen
  unicode to utf-8 Skaperen 2 795 Jul-21-2018, 08:28 PM
Last Post: Skaperen
  tf.gfile.FastGFile error unicode ( japanese characters ) majinbuu 2 659 May-13-2018, 02:11 PM
Last Post: majinbuu
  Need to replace (remove) Unicode characters in text ineuw 1 3,183 Jan-02-2018, 08:01 PM
Last Post: micseydel

Forum Jump:


Users browsing this thread: 1 Guest(s)