Python Forum
A replacement library to python chardet
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
A replacement library to python chardet
#1
Hi,

There is a very old issue regarding "encoding detection" in a text file that has been partially resolved by a program like Chardet. I did not like the idea of single prober per encoding table that could lead to hard coding specifications.

I wanted to challenge the existing methods of discovering originating encoding.
You could consider this issue as obsolete because of current norms :
You should indicate used charset encoding as described in standards
But the reality is different, a huge part of the internet still have content with an unknown encoding. (One could point out subrip subtitle (SRT) for instance)

This is why a popular package like Requests embed Chardet to guess apparent encoding on remote resources.
You should know that :

- You should not care about the originating charset encoding, that because two different table can produce two identical files.

I'm brute-forcing on three premises :

- Binaries fit encoding table
- Chaos
- Coherence

Chaos : I opened hundred of text files, written by humans, with the wrong encoding table. I observed, then I established some ground rules about what is obvious when it seems like a mess. I know that my interpretation of what is chaotic is very subjective, feel free to contribute to improve or rewrite it.

Coherence : For each language there is on earth (the best we can), we have computed letter appearance occurrences ranked. So I thought that those intel are worth something here. So I use those records against the decoded text to check if I can detect intelligent design.

So I present to you Charset Normalizer. https://github.com/Ousret/charset_normalizer
Feel free to help us though testing or contributing (in any way you like) Wink

I'm currently looking for contribution in this project.
  • Challenge 1 : Bring Python 2.7 support (even if 2.7 is dying)
  • English text review for mistake (because I'm not a native english speaker)
  • Challenge 2 : Improve performance
  • Deploy more tests or find case where it does not work


Thank you
Reply


Messages In This Thread
A replacement library to python chardet - by Ousret - Sep-06-2019, 12:20 PM

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020