Python Forum

Pages: 1 2

all i want to do is count the lines in each file but there are strange binary bytes or older ISO codes.

Output:Traceback (most recent call last):
  File "last-edited.py", line 38, in <module>
    pf(t,n)
  File "last-edited.py", line 7, in pf
    c = len([x for x in f])
  File "last-edited.py", line 7, in <listcomp>
    c = len([x for x in f])
  File "/usr/host/bin/../../lib/python3.6/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa9 in position 93: invalid start byte

from sfc import *
def pf(t,n):
    if os.path.exists(n):
        with open(n) as f:
            c = len([x for x in f])
        print(t,str(c).rjust(8),n)
fn = '.edit_log'
argv.pop(0)
cd() # be in home directory
with open(fn) as el:
    nt = {}
    for ee in el:
        t,e,n = ee.strip().split()[:3]
        if n in nt:
            nt[n].append(t)
        else:
            nt[n] = [t]
ns = sorted(nt.keys())
tn = {}
for n in ns:
    nt[n].sort()
for n in ns:
    t = nt[n][-1]
    if t[13]=='-':
        t=t[0:7]+t[8:13]+t[14:16]+t[17:]
    tn[t] = n
for t,n in sorted([x for x in sorted(tn.items())]):
    if argv:
        for a in argv:
            if n.endswith(a):
                if os.path.exists(n):
                    pf(t,n)
                break
    else: # none requested so print all
        if os.path.exists(n):
            pf(t,n)

i think the only things it uses from sfc are os and cd (called with no args changes directory to user home directory.

you need to know the file encoding.
Though not foolproof, you can usually find it with chardet: https://pypi.org/project/chardet/

maybe it will be simpler to read most of the file in binary and count the b'\n' in the bytes string i get. really big files i don't need the exact number, just a general size.

There are some strange encodings, and the easiest way to deal with them is to find the proper encoding.

if it's a strange encoding i don't care if this is accurate. this is operation on my edit log which has time of edit and a file path. this script lists recently edited files with time. i'm inserting a column for number of lines. for my Python source files and most other text and source, the encoding will be ASCII or UTF-8. the file causing problem was a C file from 2006 when i was using ISO8859 for the copyright symbol. cat that line today (the Linux kernel is doing UTF-8 reasonably well) and the copyright is just a question in an inverted cell.

i've already done this and have it showing ">9999" (if the number of lines exceeds 9999) and narrowed it to 5 characters. if the file exceeds 1048575 bytes then it prints ">####". i do f.read(1048576). i could reduce that.

If I need to have quick look at number of lines in file then I use wc in terminal.

cat my_filename | wc -l

It is easy to use in Python with subprocess. Specific example on file named shakespeare.txt; check_output returns bytes so I converted it into integer:

>>> import subprocess
>>> source = subprocess.Popen(['cat', 'shakespeare.txt'], stdout=subprocess.PIPE)   
>>> lines = int(subprocess.check_output(['wc', '-l'], stdin=source.stdout))     
>>> lines
4155

An alternative with more_itertools.ilen()

λ cat paillasse/sometest.py | wc -l
60
λ python
...
>>> from more_itertools import ilen
>>> with open('paillasse/sometest.py') as ifh:
...     print(ilen(ifh))
... 
60

(May-18-2021, 11:42 PM)Skaperen Wrote: [ -> ]if it's a strange encoding i don't care if this is accurate. this is operation on my edit log which has time of edit and a file path.

Can just ignore encoding errors,there is a parameter for this errors="ignore" or errors='replace'(will be ?).
So can do a version showing this can just copy my own code from this Thread and make a little change.

import os

def find_files(file_type, path):
    os.chdir(path)
    with os.scandir(path) as it:
        for entry in it:
            if entry.name.endswith(file_type) and entry.is_file():
                yield entry.name

def count_lines(files):
    for file in files:
        with open(file, encoding='utf-8', errors="ignore") as f:
            for line_nr, _ in enumerate(f, -1):
                pass
        yield file, line_nr + 1

if __name__ == '__main__':
    path = r'E:\div_code'
    file_type = '.txt'
    files = find_files(file_type, path)
    line_count = count_lines(files)
    print(list(line_count))

Output:
[('alice_in_wonderland.txt', 3599), ('test.txt', 1807), ('W2Testfile.txt', 1396)]

Compare with wc

λ wc -l *.txt
  1396 W2Testfile.txt
  3599 alice_in_wonderland.txt
  1807 test.txt
  6802 total

Hi, I don't understand lambda in this post, can you please explain?

I tried:

Quote:λ wc -l /home/pedro/summer2021/19BE/scansforOCR/*.text

in bash and just got:

Quote:pedro@pedro-HP:~$ λ wc /home/pedro/summer2021/19BE/scansforOCR/*.text
λ: command not found
pedro@pedro-HP:~$

(May-21-2021, 11:01 AM)Pedroski55 Wrote: [ -> ]Hi, I don't understand lambda in this post, can you please explain?

You shall not use λ ,it's default sign because i use cmder

wc --help
Usage: wc [OPTION]... [FILE]...
  or:  wc [OPTION]... --files0-from=F
Print newline, word, and byte counts for each FILE, and a total line if
more than one FILE is specified.  A word is a non-zero-length sequence of
characters delimited by white space.

With no FILE, or when FILE is -, read standard input.

The options below may be used to select which counts are printed, always in
the following order: newline, word, character, byte, maximum line length.
  -c, --bytes            print the byte counts
  -m, --chars            print the character counts
  -l, --lines            print the newline counts
      --files0-from=F    read input from the files specified by
                           NUL-terminated names in file F;
                           If F is - then read names from standard input
  -L, --max-line-length  print the maximum display width
  -w, --words            print the word counts
      --help     display this help and exit
      --version  output version information and exit

GNU coreutils online help: <http://www.gnu.org/software/coreutils/>
Report wc translation bugs to <http://translationproject.org/team/>
Full documentation at: <http://www.gnu.org/software/coreutils/wc>
or available locally via: info '(coreutils) wc invocation'

Pages: 1 2

Skaperen

Larz60+

Skaperen

Larz60+

Skaperen

perfringo

Gribouillis

snippsat

Pedroski55

snippsat