all i want to do is count the lines in each file

Skaperen · May-18-2021, 06:21 AM

all i want to do is count the lines in each file but there are strange binary bytes or older ISO codes.

Output:Traceback (most recent call last):
  File "last-edited.py", line 38, in <module>
    pf(t,n)
  File "last-edited.py", line 7, in pf
    c = len([x for x in f])
  File "last-edited.py", line 7, in <listcomp>
    c = len([x for x in f])
  File "/usr/host/bin/../../lib/python3.6/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa9 in position 93: invalid start byte

from sfc import *
def pf(t,n):
    if os.path.exists(n):
        with open(n) as f:
            c = len([x for x in f])
        print(t,str(c).rjust(8),n)
fn = '.edit_log'
argv.pop(0)
cd() # be in home directory
with open(fn) as el:
    nt = {}
    for ee in el:
        t,e,n = ee.strip().split()[:3]
        if n in nt:
            nt[n].append(t)
        else:
            nt[n] = [t]
ns = sorted(nt.keys())
tn = {}
for n in ns:
    nt[n].sort()
for n in ns:
    t = nt[n][-1]
    if t[13]=='-':
        t=t[0:7]+t[8:13]+t[14:16]+t[17:]
    tn[t] = n
for t,n in sorted([x for x in sorted(tn.items())]):
    if argv:
        for a in argv:
            if n.endswith(a):
                if os.path.exists(n):
                    pf(t,n)
                break
    else: # none requested so print all
        if os.path.exists(n):
            pf(t,n)

i think the only things it uses from sfc are os and cd (called with no args changes directory to user home directory.

**Larz60+** · May-18-2021, 08:58 AM

you need to know the file encoding.
Though not foolproof, you can usually find it with chardet: https://pypi.org/project/chardet/

Skaperen · May-18-2021, 06:56 PM

maybe it will be simpler to read most of the file in binary and count the b'\n' in the bytes string i get. really big files i don't need the exact number, just a general size.

**Larz60+** · May-18-2021, 10:53 PM

There are some strange encodings, and the easiest way to deal with them is to find the proper encoding.

Skaperen · (This post was last modified: May-18-2021, 11:43 PM by Skaperen.)

if it's a strange encoding i don't care if this is accurate. this is operation on my edit log which has time of edit and a file path. this script lists recently edited files with time. i'm inserting a column for number of lines. for my Python source files and most other text and source, the encoding will be ASCII or UTF-8. the file causing problem was a C file from 2006 when i was using ISO8859 for the copyright symbol. cat that line today (the Linux kernel is doing UTF-8 reasonably well) and the copyright is just a question in an inverted cell.

i've already done this and have it showing ">9999" (if the number of lines exceeds 9999) and narrowed it to 5 characters. if the file exceeds 1048575 bytes then it prints ">####". i do f.read(1048576). i could reduce that.

**perfringo** · May-19-2021, 07:37 AM

If I need to have quick look at number of lines in file then I use wc in terminal.

cat my_filename | wc -l

It is easy to use in Python with subprocess. Specific example on file named shakespeare.txt; check_output returns bytes so I converted it into integer:

>>> import subprocess
>>> source = subprocess.Popen(['cat', 'shakespeare.txt'], stdout=subprocess.PIPE)   
>>> lines = int(subprocess.check_output(['wc', '-l'], stdin=source.stdout))     
>>> lines
4155

**Gribouillis** · (This post was last modified: May-19-2021, 08:59 AM by Gribouillis.)

An alternative with more_itertools.ilen()

λ cat paillasse/sometest.py | wc -l
60
λ python
...
>>> from more_itertools import ilen
>>> with open('paillasse/sometest.py') as ifh:
...     print(ilen(ifh))
... 
60

***snippsat*** · (This post was last modified: May-19-2021, 12:02 PM by snippsat.)

(May-18-2021, 11:42 PM)Skaperen Wrote: if it's a strange encoding i don't care if this is accurate. this is operation on my edit log which has time of edit and a file path.

Can just ignore encoding errors,there is a parameter for this errors="ignore" or errors='replace'(will be ?).
So can do a version showing this can just copy my own code from this Thread and make a little change.

import os

def find_files(file_type, path):
    os.chdir(path)
    with os.scandir(path) as it:
        for entry in it:
            if entry.name.endswith(file_type) and entry.is_file():
                yield entry.name

def count_lines(files):
    for file in files:
        with open(file, encoding='utf-8', errors="ignore") as f:
            for line_nr, _ in enumerate(f, -1):
                pass
        yield file, line_nr + 1

if __name__ == '__main__':
    path = r'E:\div_code'
    file_type = '.txt'
    files = find_files(file_type, path)
    line_count = count_lines(files)
    print(list(line_count))

Output:
[('alice_in_wonderland.txt', 3599), ('test.txt', 1807), ('W2Testfile.txt', 1396)]

Compare with wc

λ wc -l *.txt
  1396 W2Testfile.txt
  3599 alice_in_wonderland.txt
  1807 test.txt
  6802 total

Pedroski55 · (This post was last modified: May-21-2021, 11:02 AM by Pedroski55.)

Hi, I don't understand lambda in this post, can you please explain?

I tried:

Quote:λ wc -l /home/pedro/summer2021/19BE/scansforOCR/*.text

in bash and just got:

Quote:pedro@pedro-HP:~$ λ wc /home/pedro/summer2021/19BE/scansforOCR/*.text
λ: command not found
pedro@pedro-HP:~$

***snippsat*** · May-21-2021, 11:42 AM

(May-21-2021, 11:01 AM)Pedroski55 Wrote: Hi, I don't understand lambda in this post, can you please explain?

You shall not use λ ,it's default sign because i use cmder

wc --help
Usage: wc [OPTION]... [FILE]...
  or:  wc [OPTION]... --files0-from=F
Print newline, word, and byte counts for each FILE, and a total line if
more than one FILE is specified.  A word is a non-zero-length sequence of
characters delimited by white space.

With no FILE, or when FILE is -, read standard input.

The options below may be used to select which counts are printed, always in
the following order: newline, word, character, byte, maximum line length.
  -c, --bytes            print the byte counts
  -m, --chars            print the character counts
  -l, --lines            print the newline counts
      --files0-from=F    read input from the files specified by
                           NUL-terminated names in file F;
                           If F is - then read names from standard input
  -L, --max-line-length  print the maximum display width
  -w, --words            print the word counts
      --help     display this help and exit
      --version  output version information and exit

GNU coreutils online help: <http://www.gnu.org/software/coreutils/>
Report wc translation bugs to <http://translationproject.org/team/>
Full documentation at: <http://www.gnu.org/software/coreutils/wc>
or available locally via: info '(coreutils) wc invocation'

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	[solved] how to delete the 10 first lines of an ascii file	paul18fr	7	1,860	Aug-07-2024, 08:18 PM Last Post: Gribouillis
	Row Count and coloumn count	Yegor123	4	2,732	Oct-18-2022, 03:52 AM Last Post: Yegor123
	Delete multiple lines from txt file	Lky	6	4,026	Jul-10-2022, 12:09 PM Last Post: jefsummers
	failing to print not matched lines from second file	tester_V	14	9,108	Apr-05-2022, 11:56 AM Last Post: codinglearner
	Extracting Specific Lines from text file based on content.	jokerfmj	8	5,705	Mar-28-2022, 03:38 PM Last Post: snippsat
	Importing a function from another file runs the old lines also	dedesssse	6	3,788	Jul-06-2021, 07:04 PM Last Post: deanhystad
	[Solved] Trying to read specific lines from a file	Laplace12	7	5,189	Jun-21-2021, 11:15 AM Last Post: Laplace12
	Split Characters As Lines in File	quest_	3	3,415	Dec-28-2020, 09:31 AM Last Post: quest_
	How to use the count function from an Excel file using Python?	jpy	2	6,051	Dec-21-2020, 12:30 AM Last Post: jpy
	Find lines from one file in another	tester_V	8	4,875	Nov-15-2020, 03:29 AM Last Post: tester_V

all i want to do is count the lines in each file

User Panel Messages

Announcements