A replacement library to python chardet - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: General (https://python-forum.io/forum-1.html) +--- Forum: Code sharing (https://python-forum.io/forum-5.html) +--- Thread: A replacement library to python chardet (/thread-20919.html) |
A replacement library to python chardet - Ousret - Sep-06-2019 Hi, There is a very old issue regarding "encoding detection" in a text file that has been partially resolved by a program like Chardet. I did not like the idea of single prober per encoding table that could lead to hard coding specifications. I wanted to challenge the existing methods of discovering originating encoding. You could consider this issue as obsolete because of current norms : You should indicate used charset encoding as described in standards But the reality is different, a huge part of the internet still have content with an unknown encoding. (One could point out subrip subtitle (SRT) for instance) This is why a popular package like Requests embed Chardet to guess apparent encoding on remote resources. You should know that : - You should not care about the originating charset encoding, that because two different table can produce two identical files. I'm brute-forcing on three premises : - Binaries fit encoding table - Chaos - Coherence Chaos : I opened hundred of text files, written by humans, with the wrong encoding table. I observed, then I established some ground rules about what is obvious when it seems like a mess. I know that my interpretation of what is chaotic is very subjective, feel free to contribute to improve or rewrite it. Coherence : For each language there is on earth (the best we can), we have computed letter appearance occurrences ranked. So I thought that those intel are worth something here. So I use those records against the decoded text to check if I can detect intelligent design. So I present to you Charset Normalizer. https://github.com/Ousret/charset_normalizer Feel free to help us though testing or contributing (in any way you like) I'm currently looking for contribution in this project.
Thank you RE: A replacement library to python chardet - Ousret - Sep-23-2019 Hi, Since Sep-06 a lot have been done. the results are encouraging. Maybe worth a second look :) |