Mar-31-2017, 06:45 PM
Hi guys.
I have been working in a script to read files, extract and index some strings. Everything is going fine, except for one problem with file encoding.
I will try to resume:
My script read chunk of data from files, convert these chunks to lowercase and normalize to remove replace special characters (ç, á, é, ã, etc) using unicodedata.normalize:
In this case, the string "Olá, como vai você? Vamos caçar?" will result in "Ola, como vai voce? Vamos cacar?" It is working fine with UTF8 encoded text files, but it fails when trying to retrieve strings from binary files (like MS .doc files). Using the same code from above, will return me a string "Ol, como vai voc? Vamos caar?"
I have managed to get it working with MS .doc files using the unicode-escape (but it will fail with the UTF8 files).
Unfortunatelly I cant use external modules.
I will buy a beer or a pack of beers for a working solution :D
I have been working in a script to read files, extract and index some strings. Everything is going fine, except for one problem with file encoding.
I will try to resume:
My script read chunk of data from files, convert these chunks to lowercase and normalize to remove replace special characters (ç, á, é, ã, etc) using unicodedata.normalize:
unicodedata.normalize("NFKD",chunk.decode("utf8","ignore")).encode("ascii","ignore").lower()
In this case, the string "Olá, como vai você? Vamos caçar?" will result in "Ola, como vai voce? Vamos cacar?" It is working fine with UTF8 encoded text files, but it fails when trying to retrieve strings from binary files (like MS .doc files). Using the same code from above, will return me a string "Ol, como vai voc? Vamos caar?"
I have managed to get it working with MS .doc files using the unicode-escape (but it will fail with the UTF8 files).
unicodedata.normalize("NFKD",chunk.decode("unicode-escape","ignore")).encode("ascii","ignore").lower()After 20+ hours of research I have no solution yet to get my script running in both cases.
Unfortunatelly I cant use external modules.
I will buy a beer or a pack of beers for a working solution :D