Python Forum
is this file an ASCII text file? - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: General (https://python-forum.io/forum-1.html)
+--- Forum: News and Discussions (https://python-forum.io/forum-31.html)
+--- Thread: is this file an ASCII text file? (/thread-22269.html)



is this file an ASCII text file? - Skaperen - Nov-06-2019

in POSIX environments the "file" command can be run on the file and its output could indicate an ASCII text file or something it knows about. is there a better way in Python to check if a file is an ASCII text file or should i just go ahead and run the "file" command?


RE: is this file an ASCII text file? - DeaD_EyE - Nov-06-2019

https://github.com/file/file/blob/master/python/magic.py

https://pypi.org/project/python-magic/

To check with pure Python if something is only ASCII:
def is_ascii(file):
    with open(file, 'rb') as fd:
        return fd.read().isascii()
But this implementation is very odd. It has to read the whole file and then the check is done.
Reading char by char will be slower (more calls), but allows to break out of the loop.
Better is to use a bytearray as buffer.


def is_ascii(file, buffersize=64*1024**1):
    buffer = bytearray(buffersize)
    with open(file, 'rb') as fd:
        n = fd.readinto(buffer)
        if not buffer[:n].isascii():
            return False
    return True
The first function takes more than 60 seconds to check a 2GiB file and consumes 2GiB in RAM.
The second function consumes only 64KiB in RAM for buffer and stops early, if a non-ascii char is detected.
The second function has to check also the whole file, to give the guarantee, that everything is ascii.


RE: is this file an ASCII text file? - Skaperen - Nov-07-2019

i think that what the "file" command does is read just one sector or page and does what it can with that.


RE: is this file an ASCII text file? - snippsat - Nov-07-2019

Python has Chardet.
Test of file and Chardet.
E:\div_code
λ file file_encodint.txt
file_encodint.txt: ASCII text, with CRLF line terminators

E:\div_code
λ chardetect file_encodint.txt
file_encodint.txt: ascii with confidence 1.0
Change to utf-8.
E:\div_code
λ file file_encodint.txt
file_encodint.txt: UTF-8 Unicode text, with CRLF line terminators

E:\div_code
λ chardetect file_encodint.txt
file_encodint.txt: utf-8 with confidence 0.87625
I have used Chardet several times on this forum,example here user did post the file.
So could use Chardet on file cities.txt and discover that is was encoded with Windows-1252 encoding.
with open('cities.txt', encoding='cp1252') as f:
    print(f.read()[:500])
UTF-8 should be the standard that everyone should use if content has Unicode Pray,but that's not always case.
uni_hello = 'hello Χαίρετε добры дзень 여보세요'
 
with open('uni_hello.txt', 'w', encoding='utf-8') as f_out:
    f_out.write(uni_hello) 
with open('uni_hello.txt', encoding="utf-8") as f:
    print(f.read())
Output:
hello Χαίρετε добры дзень 여보세요



RE: is this file an ASCII text file? - Skaperen - Nov-08-2019

chardet is a package with a command? or did you make the command? i'm always curious what the API is like for various library things.


RE: is this file an ASCII text file? - snippsat - Nov-08-2019

chardetect is a command Chardet make for use at command line.
It's done in setup.py
entry_points={'console_scripts':
                    ['chardetect = chardet.cli.chardetect:main']})
entry_points make chardetect available at command line when do pip install chardet this work on all OS.
On Windows it will even make a .exe so chardetect.exe and place it in ‪Python37\Scripts\chardetect.exe
Setuptools doc