Bottom Page

Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
 is this file an ASCII text file?
#1
in POSIX environments the "file" command can be run on the file and its output could indicate an ASCII text file or something it knows about. is there a better way in Python to check if a file is an ASCII text file or should i just go ahead and run the "file" command?
What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Quote
#2
https://github.com/file/file/blob/master...n/magic.py

https://pypi.org/project/python-magic/

To check with pure Python if something is only ASCII:
def is_ascii(file):
    with open(file, 'rb') as fd:
        return fd.read().isascii()
But this implementation is very odd. It has to read the whole file and then the check is done.
Reading char by char will be slower (more calls), but allows to break out of the loop.
Better is to use a bytearray as buffer.


def is_ascii(file, buffersize=64*1024**1):
    buffer = bytearray(buffersize)
    with open(file, 'rb') as fd:
        n = fd.readinto(buffer)
        if not buffer[:n].isascii():
            return False
    return True
The first function takes more than 60 seconds to check a 2GiB file and consumes 2GiB in RAM.
The second function consumes only 64KiB in RAM for buffer and stops early, if a non-ascii char is detected.
The second function has to check also the whole file, to give the guarantee, that everything is ascii.
Gribouillis likes this post
My code examples are always for Python >=3.6.0
Almost dead, but too lazy to die: https://sourceserver.info
All humans together. We don't need politicians!
Quote
#3
i think that what the "file" command does is read just one sector or page and does what it can with that.
What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Quote
#4
Python has Chardet.
Test of file and Chardet.
E:\div_code
λ file file_encodint.txt
file_encodint.txt: ASCII text, with CRLF line terminators

E:\div_code
λ chardetect file_encodint.txt
file_encodint.txt: ascii with confidence 1.0
Change to utf-8.
E:\div_code
λ file file_encodint.txt
file_encodint.txt: UTF-8 Unicode text, with CRLF line terminators

E:\div_code
λ chardetect file_encodint.txt
file_encodint.txt: utf-8 with confidence 0.87625
I have used Chardet several times on this forum,example here user did post the file.
So could use Chardet on file cities.txt and discover that is was encoded with Windows-1252 encoding.
with open('cities.txt', encoding='cp1252') as f:
    print(f.read()[:500])
UTF-8 should be the standard that everyone should use if content has Unicode Pray,but that's not always case.
uni_hello = 'hello Χαίρετε добры дзень 여보세요'
 
with open('uni_hello.txt', 'w', encoding='utf-8') as f_out:
    f_out.write(uni_hello) 
with open('uni_hello.txt', encoding="utf-8") as f:
    print(f.read())
Output:
hello Χαίρετε добры дзень 여보세요
Quote
#5
chardet is a package with a command? or did you make the command? i'm always curious what the API is like for various library things.
What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Quote
#6
chardetect is a command Chardet make for use at command line.
It's done in setup.py
entry_points={'console_scripts':
                    ['chardetect = chardet.cli.chardetect:main']})
entry_points make chardetect available at command line when do pip install chardet this work on all OS.
On Windows it will even make a .exe so chardetect.exe and place it in ‪Python37\Scripts\chardetect.exe
Setuptools doc
Quote

Top Page

Possibly Related Threads...
Thread Author Replies Views Last Post
  Text Summarization CSV File Shivi_Bhatia 0 103 Oct-21-2019, 03:48 PM
Last Post: Shivi_Bhatia
  Text file data to meaningful .csv conversion shrutika 1 242 May-10-2019, 05:56 AM
Last Post: buran
  py.exe: printing ascii number 7 qrani 3 783 Aug-27-2018, 01:35 PM
Last Post: Axel_Erfurt

Forum Jump:


Users browsing this thread: 1 Guest(s)