Python Forum
all i want to do is count the lines in each file - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: General Coding Help (https://python-forum.io/forum-8.html)
+--- Thread: all i want to do is count the lines in each file (/thread-33699.html)

Pages: 1 2


all i want to do is count the lines in each file - Skaperen - May-18-2021

all i want to do is count the lines in each file but there are strange binary bytes or older ISO codes.

Output:
Traceback (most recent call last): File "last-edited.py", line 38, in <module> pf(t,n) File "last-edited.py", line 7, in pf c = len([x for x in f]) File "last-edited.py", line 7, in <listcomp> c = len([x for x in f]) File "/usr/host/bin/../../lib/python3.6/codecs.py", line 321, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa9 in position 93: invalid start byte
from sfc import *
def pf(t,n):
    if os.path.exists(n):
        with open(n) as f:
            c = len([x for x in f])
        print(t,str(c).rjust(8),n)
fn = '.edit_log'
argv.pop(0)
cd() # be in home directory
with open(fn) as el:
    nt = {}
    for ee in el:
        t,e,n = ee.strip().split()[:3]
        if n in nt:
            nt[n].append(t)
        else:
            nt[n] = [t]
ns = sorted(nt.keys())
tn = {}
for n in ns:
    nt[n].sort()
for n in ns:
    t = nt[n][-1]
    if t[13]=='-':
        t=t[0:7]+t[8:13]+t[14:16]+t[17:]
    tn[t] = n
for t,n in sorted([x for x in sorted(tn.items())]):
    if argv:
        for a in argv:
            if n.endswith(a):
                if os.path.exists(n):
                    pf(t,n)
                break
    else: # none requested so print all
        if os.path.exists(n):
            pf(t,n)
i think the only things it uses from sfc are os and cd (called with no args changes directory to user home directory.


RE: all i want to do is count the lines in each file - Larz60+ - May-18-2021

you need to know the file encoding.
Though not foolproof, you can usually find it with chardet: https://pypi.org/project/chardet/


RE: all i want to do is count the lines in each file - Skaperen - May-18-2021

maybe it will be simpler to read most of the file in binary and count the b'\n' in the bytes string i get. really big files i don't need the exact number, just a general size.


RE: all i want to do is count the lines in each file - Larz60+ - May-18-2021

There are some strange encodings, and the easiest way to deal with them is to find the proper encoding.


RE: all i want to do is count the lines in each file - Skaperen - May-18-2021

if it's a strange encoding i don't care if this is accurate. this is operation on my edit log which has time of edit and a file path. this script lists recently edited files with time. i'm inserting a column for number of lines. for my Python source files and most other text and source, the encoding will be ASCII or UTF-8. the file causing problem was a C file from 2006 when i was using ISO8859 for the copyright symbol. cat that line today (the Linux kernel is doing UTF-8 reasonably well) and the copyright is just a question in an inverted cell.

i've already done this and have it showing ">9999" (if the number of lines exceeds 9999) and narrowed it to 5 characters. if the file exceeds 1048575 bytes then it prints ">####". i do f.read(1048576). i could reduce that.


RE: all i want to do is count the lines in each file - perfringo - May-19-2021

If I need to have quick look at number of lines in file then I use wc in terminal.

cat my_filename | wc -l
It is easy to use in Python with subprocess. Specific example on file named shakespeare.txt; check_output returns bytes so I converted it into integer:

>>> import subprocess
>>> source = subprocess.Popen(['cat', 'shakespeare.txt'], stdout=subprocess.PIPE)   
>>> lines = int(subprocess.check_output(['wc', '-l'], stdin=source.stdout))     
>>> lines
4155



RE: all i want to do is count the lines in each file - Gribouillis - May-19-2021

An alternative with more_itertools.ilen()
λ cat paillasse/sometest.py | wc -l
60
λ python
...
>>> from more_itertools import ilen
>>> with open('paillasse/sometest.py') as ifh:
...     print(ilen(ifh))
... 
60



RE: all i want to do is count the lines in each file - snippsat - May-19-2021

(May-18-2021, 11:42 PM)Skaperen Wrote: if it's a strange encoding i don't care if this is accurate. this is operation on my edit log which has time of edit and a file path.
Can just ignore encoding errors,there is a parameter for this errors="ignore" or errors='replace'(will be ?).
So can do a version showing this can just copy my own code from this Thread and make a little change.
import os

def find_files(file_type, path):
    os.chdir(path)
    with os.scandir(path) as it:
        for entry in it:
            if entry.name.endswith(file_type) and entry.is_file():
                yield entry.name

def count_lines(files):
    for file in files:
        with open(file, encoding='utf-8', errors="ignore") as f:
            for line_nr, _ in enumerate(f, -1):
                pass
        yield file, line_nr + 1

if __name__ == '__main__':
    path = r'E:\div_code'
    file_type = '.txt'
    files = find_files(file_type, path)
    line_count = count_lines(files)
    print(list(line_count))
Output:
[('alice_in_wonderland.txt', 3599), ('test.txt', 1807), ('W2Testfile.txt', 1396)]
Compare with wc
λ wc -l *.txt
  1396 W2Testfile.txt
  3599 alice_in_wonderland.txt
  1807 test.txt
  6802 total



RE: all i want to do is count the lines in each file - Pedroski55 - May-21-2021

Hi, I don't understand lambda in this post, can you please explain?

I tried:

Quote:λ wc -l /home/pedro/summer2021/19BE/scansforOCR/*.text

in bash and just got:

Quote:pedro@pedro-HP:~$ λ wc /home/pedro/summer2021/19BE/scansforOCR/*.text
λ: command not found
pedro@pedro-HP:~$



RE: all i want to do is count the lines in each file - snippsat - May-21-2021

(May-21-2021, 11:01 AM)Pedroski55 Wrote: Hi, I don't understand lambda in this post, can you please explain?
You shall not use λ ,it's default sign because i use cmder
wc --help
Usage: wc [OPTION]... [FILE]...
  or:  wc [OPTION]... --files0-from=F
Print newline, word, and byte counts for each FILE, and a total line if
more than one FILE is specified.  A word is a non-zero-length sequence of
characters delimited by white space.

With no FILE, or when FILE is -, read standard input.

The options below may be used to select which counts are printed, always in
the following order: newline, word, character, byte, maximum line length.
  -c, --bytes            print the byte counts
  -m, --chars            print the character counts
  -l, --lines            print the newline counts
      --files0-from=F    read input from the files specified by
                           NUL-terminated names in file F;
                           If F is - then read names from standard input
  -L, --max-line-length  print the maximum display width
  -w, --words            print the word counts
      --help     display this help and exit
      --version  output version information and exit

GNU coreutils online help: <http://www.gnu.org/software/coreutils/>
Report wc translation bugs to <http://translationproject.org/team/>
Full documentation at: <http://www.gnu.org/software/coreutils/wc>
or available locally via: info '(coreutils) wc invocation'