Python Forum
is this file an ASCII text file?
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
is this file an ASCII text file?
#1
in POSIX environments the "file" command can be run on the file and its output could indicate an ASCII text file or something it knows about. is there a better way in Python to check if a file is an ASCII text file or should i just go ahead and run the "file" command?
Tradition is peer pressure from dead people

What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Reply
#2
https://github.com/file/file/blob/master...n/magic.py

https://pypi.org/project/python-magic/

To check with pure Python if something is only ASCII:
def is_ascii(file):
    with open(file, 'rb') as fd:
        return fd.read().isascii()
But this implementation is very odd. It has to read the whole file and then the check is done.
Reading char by char will be slower (more calls), but allows to break out of the loop.
Better is to use a bytearray as buffer.


def is_ascii(file, buffersize=64*1024**1):
    buffer = bytearray(buffersize)
    with open(file, 'rb') as fd:
        n = fd.readinto(buffer)
        if not buffer[:n].isascii():
            return False
    return True
The first function takes more than 60 seconds to check a 2GiB file and consumes 2GiB in RAM.
The second function consumes only 64KiB in RAM for buffer and stops early, if a non-ascii char is detected.
The second function has to check also the whole file, to give the guarantee, that everything is ascii.
Almost dead, but too lazy to die: https://sourceserver.info
All humans together. We don't need politicians!
Reply
#3
i think that what the "file" command does is read just one sector or page and does what it can with that.
Tradition is peer pressure from dead people

What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Reply
#4
Python has Chardet.
Test of file and Chardet.
E:\div_code
λ file file_encodint.txt
file_encodint.txt: ASCII text, with CRLF line terminators

E:\div_code
λ chardetect file_encodint.txt
file_encodint.txt: ascii with confidence 1.0
Change to utf-8.
E:\div_code
λ file file_encodint.txt
file_encodint.txt: UTF-8 Unicode text, with CRLF line terminators

E:\div_code
λ chardetect file_encodint.txt
file_encodint.txt: utf-8 with confidence 0.87625
I have used Chardet several times on this forum,example here user did post the file.
So could use Chardet on file cities.txt and discover that is was encoded with Windows-1252 encoding.
with open('cities.txt', encoding='cp1252') as f:
    print(f.read()[:500])
UTF-8 should be the standard that everyone should use if content has Unicode Pray,but that's not always case.
uni_hello = 'hello Χαίρετε добры дзень 여보세요'
 
with open('uni_hello.txt', 'w', encoding='utf-8') as f_out:
    f_out.write(uni_hello) 
with open('uni_hello.txt', encoding="utf-8") as f:
    print(f.read())
Output:
hello Χαίρετε добры дзень 여보세요
Reply
#5
chardet is a package with a command? or did you make the command? i'm always curious what the API is like for various library things.
Tradition is peer pressure from dead people

What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Reply
#6
chardetect is a command Chardet make for use at command line.
It's done in setup.py
entry_points={'console_scripts':
                    ['chardetect = chardet.cli.chardetect:main']})
entry_points make chardetect available at command line when do pip install chardet this work on all OS.
On Windows it will even make a .exe so chardetect.exe and place it in ‪Python37\Scripts\chardetect.exe
Setuptools doc
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Text Summarization CSV File Shivi_Bhatia 0 2,554 Oct-21-2019, 03:48 PM
Last Post: Shivi_Bhatia
  Text file data to meaningful .csv conversion shrutika 1 1,806 May-10-2019, 05:56 AM
Last Post: buran

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020