in POSIX environments the "file" command can be run on the file and its output could indicate an ASCII text file or something it knows about. is there a better way in Python to check if a file is an ASCII text file or should i just go ahead and run the "file" command?
https://github.com/file/file/blob/master...n/magic.py
https://pypi.org/project/python-magic/
To check with pure Python if something is only ASCII:
def is_ascii(file):
with open(file, 'rb') as fd:
return fd.read().isascii()
But this implementation is very odd. It has to read the whole file and then the check is done.
Reading char by char will be slower (more calls), but allows to break out of the loop.
Better is to use a bytearray as buffer.
def is_ascii(file, buffersize=64*1024**1):
buffer = bytearray(buffersize)
with open(file, 'rb') as fd:
n = fd.readinto(buffer)
if not buffer[:n].isascii():
return False
return True
The first function takes more than 60 seconds to check a 2GiB file and consumes 2GiB in RAM.
The second function consumes only 64KiB in RAM for buffer and stops early, if a non-ascii char is detected.
The second function has to check also the whole file, to give the guarantee, that everything is ascii.
i think that what the "file" command does is read just one sector or page and does what it can with that.
Python has
Chardet.
Test of file and Chardet.
E:\div_code
λ file file_encodint.txt
file_encodint.txt: ASCII text, with CRLF line terminators
E:\div_code
λ chardetect file_encodint.txt
file_encodint.txt: ascii with confidence 1.0
Change to utf-8.
E:\div_code
λ file file_encodint.txt
file_encodint.txt: UTF-8 Unicode text, with CRLF line terminators
E:\div_code
λ chardetect file_encodint.txt
file_encodint.txt: utf-8 with confidence 0.87625
I have used Chardet several times on this forum,example
here user did post the file.
So could use Chardet on file
cities.txt
and discover that is was encoded with Windows-1252 encoding.
with open('cities.txt', encoding='cp1252') as f:
print(f.read()[:500])
UTF-8
should be the standard that everyone should use if content has Unicode
,but that's not always case.
uni_hello = 'hello Χαίρετε добры дзень 여보세요'
with open('uni_hello.txt', 'w', encoding='utf-8') as f_out:
f_out.write(uni_hello)
with open('uni_hello.txt', encoding="utf-8") as f:
print(f.read())
Output:
hello Χαίρετε добры дзень 여보세요
chardet is a package with a command? or did you make the command? i'm always curious what the API is like for various library things.
chardetect
is a command Chardet make for use at command line.
It's done in
setup.py
entry_points={'console_scripts':
['chardetect = chardet.cli.chardetect:main']})
entry_points
make
chardetect
available at command line when do
pip install chardet
this work on all OS.
On Windows it will even make a
.exe
so
chardetect.exe
and place it in
Python37\Scripts\chardetect.exe
Setuptools doc