Python Forum
[?] UTF8, Unicode and Binary data reading troubles - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: General Coding Help (https://python-forum.io/forum-8.html)
+--- Thread: [?] UTF8, Unicode and Binary data reading troubles (/thread-2659.html)



[?] UTF8, Unicode and Binary data reading troubles - doublezero - Mar-31-2017

Hi guys.

I have been working in a script to read files, extract and index some strings. Everything is going fine, except for one problem with file encoding.

I will try to resume:

My script read chunk of data from files, convert these chunks to lowercase and normalize to remove replace special characters (ç, á, é, ã, etc) using unicodedata.normalize:

unicodedata.normalize("NFKD",chunk.decode("utf8","ignore")).encode("ascii","ignore").lower()
 
In this case, the string "Olá, como vai você? Vamos caçar?" will result in "Ola, como vai voce? Vamos cacar?" It is working fine with UTF8 encoded text files, but it fails when trying to retrieve strings from binary files (like MS .doc files). Using the same code from above, will return me a string "Ol, como vai voc? Vamos caar?"

I have managed to get it working with MS .doc files using the unicode-escape (but it will fail with the UTF8 files).
unicodedata.normalize("NFKD",chunk.decode("unicode-escape","ignore")).encode("ascii","ignore").lower()
After 20+ hours of research I have no solution yet to get my script running in both cases.


Unfortunatelly I cant use external modules.
I will buy a beer or a pack of beers for a working solution :D


RE: [?] UTF8, Unicode and Binary data reading troubles - Ofnuts - Mar-31-2017

Not all strings or files are UTF-8 encoded. There are heuristics to guess an encoding, but no hard rules.