Python Forum
[?] UTF8, Unicode and Binary data reading troubles
Thread Rating:
  • 1 Vote(s) - 2 Average
  • 1
  • 2
  • 3
  • 4
  • 5
[?] UTF8, Unicode and Binary data reading troubles
#1
Hi guys.

I have been working in a script to read files, extract and index some strings. Everything is going fine, except for one problem with file encoding.

I will try to resume:

My script read chunk of data from files, convert these chunks to lowercase and normalize to remove replace special characters (ç, á, é, ã, etc) using unicodedata.normalize:

unicodedata.normalize("NFKD",chunk.decode("utf8","ignore")).encode("ascii","ignore").lower()
 
In this case, the string "Olá, como vai você? Vamos caçar?" will result in "Ola, como vai voce? Vamos cacar?" It is working fine with UTF8 encoded text files, but it fails when trying to retrieve strings from binary files (like MS .doc files). Using the same code from above, will return me a string "Ol, como vai voc? Vamos caar?"

I have managed to get it working with MS .doc files using the unicode-escape (but it will fail with the UTF8 files).
unicodedata.normalize("NFKD",chunk.decode("unicode-escape","ignore")).encode("ascii","ignore").lower()
After 20+ hours of research I have no solution yet to get my script running in both cases.


Unfortunatelly I cant use external modules.
I will buy a beer or a pack of beers for a working solution :D
Reply
#2
Not all strings or files are UTF-8 encoded. There are heuristics to guess an encoding, but no hard rules.
Unless noted otherwise, code in my posts should be understood as "coding suggestions", and its use may require more neurones than the two necessary for Ctrl-C/Ctrl-V.
Your one-stop place for all your GIMP needs: gimp-forum.net
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  troubles with tqdm max22 2 519 Nov-27-2023, 09:20 PM
Last Post: max22
  Reading All The RAW Data Inside a PDF NBAComputerMan 4 1,350 Nov-30-2022, 10:54 PM
Last Post: Larz60+
  Reading Data from JSON tpolim008 2 1,084 Sep-27-2022, 06:34 PM
Last Post: Larz60+
  [SOLVED] [Windows] Converting filename to UTF8? Winfried 5 2,556 Sep-06-2022, 10:47 PM
Last Post: snippsat
  Help reading data from serial RS485 korenron 8 13,995 Nov-14-2021, 06:49 AM
Last Post: korenron
  Help with WebSocket reading data from anoter function korenron 0 1,337 Sep-19-2021, 11:08 AM
Last Post: korenron
  Fastest Way of Writing/Reading Data JamesA 1 2,194 Jul-27-2021, 03:52 PM
Last Post: Larz60+
  How to convert binary data into text? ZYSIA 3 2,638 Jul-16-2021, 04:18 PM
Last Post: deanhystad
  Reading data to python: turn into list or dataframe hhchenfx 2 5,386 Jun-01-2021, 10:28 AM
Last Post: Larz60+
  Reading data from mysql. stsxbel 2 2,216 May-23-2021, 06:56 PM
Last Post: stsxbel

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020