Python Forum
[?] UTF8, Unicode and Binary data reading troubles
Thread Rating:
  • 1 Vote(s) - 2 Average
  • 1
  • 2
  • 3
  • 4
  • 5
[?] UTF8, Unicode and Binary data reading troubles
#1
Hi guys.

I have been working in a script to read files, extract and index some strings. Everything is going fine, except for one problem with file encoding.

I will try to resume:

My script read chunk of data from files, convert these chunks to lowercase and normalize to remove replace special characters (ç, á, é, ã, etc) using unicodedata.normalize:

unicodedata.normalize("NFKD",chunk.decode("utf8","ignore")).encode("ascii","ignore").lower()
 
In this case, the string "Olá, como vai você? Vamos caçar?" will result in "Ola, como vai voce? Vamos cacar?" It is working fine with UTF8 encoded text files, but it fails when trying to retrieve strings from binary files (like MS .doc files). Using the same code from above, will return me a string "Ol, como vai voc? Vamos caar?"

I have managed to get it working with MS .doc files using the unicode-escape (but it will fail with the UTF8 files).
unicodedata.normalize("NFKD",chunk.decode("unicode-escape","ignore")).encode("ascii","ignore").lower()
After 20+ hours of research I have no solution yet to get my script running in both cases.


Unfortunatelly I cant use external modules.
I will buy a beer or a pack of beers for a working solution :D
Reply


Messages In This Thread
[?] UTF8, Unicode and Binary data reading troubles - by doublezero - Mar-31-2017, 06:45 PM

Possibly Related Threads…
Thread Author Replies Views Last Post
  troubles with tqdm max22 2 541 Nov-27-2023, 09:20 PM
Last Post: max22
  Reading All The RAW Data Inside a PDF NBAComputerMan 4 1,410 Nov-30-2022, 10:54 PM
Last Post: Larz60+
  Reading Data from JSON tpolim008 2 1,128 Sep-27-2022, 06:34 PM
Last Post: Larz60+
  [SOLVED] [Windows] Converting filename to UTF8? Winfried 5 2,618 Sep-06-2022, 10:47 PM
Last Post: snippsat
  Help reading data from serial RS485 korenron 8 14,140 Nov-14-2021, 06:49 AM
Last Post: korenron
  Help with WebSocket reading data from anoter function korenron 0 1,353 Sep-19-2021, 11:08 AM
Last Post: korenron
  Fastest Way of Writing/Reading Data JamesA 1 2,222 Jul-27-2021, 03:52 PM
Last Post: Larz60+
  How to convert binary data into text? ZYSIA 3 2,671 Jul-16-2021, 04:18 PM
Last Post: deanhystad
  Reading data to python: turn into list or dataframe hhchenfx 2 5,423 Jun-01-2021, 10:28 AM
Last Post: Larz60+
  Reading data from mysql. stsxbel 2 2,250 May-23-2021, 06:56 PM
Last Post: stsxbel

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020