Python Forum
all i want to do is count the lines in each file
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
all i want to do is count the lines in each file
#1
all i want to do is count the lines in each file but there are strange binary bytes or older ISO codes.

Output:
Traceback (most recent call last): File "last-edited.py", line 38, in <module> pf(t,n) File "last-edited.py", line 7, in pf c = len([x for x in f]) File "last-edited.py", line 7, in <listcomp> c = len([x for x in f]) File "/usr/host/bin/../../lib/python3.6/codecs.py", line 321, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa9 in position 93: invalid start byte
from sfc import *
def pf(t,n):
    if os.path.exists(n):
        with open(n) as f:
            c = len([x for x in f])
        print(t,str(c).rjust(8),n)
fn = '.edit_log'
argv.pop(0)
cd() # be in home directory
with open(fn) as el:
    nt = {}
    for ee in el:
        t,e,n = ee.strip().split()[:3]
        if n in nt:
            nt[n].append(t)
        else:
            nt[n] = [t]
ns = sorted(nt.keys())
tn = {}
for n in ns:
    nt[n].sort()
for n in ns:
    t = nt[n][-1]
    if t[13]=='-':
        t=t[0:7]+t[8:13]+t[14:16]+t[17:]
    tn[t] = n
for t,n in sorted([x for x in sorted(tn.items())]):
    if argv:
        for a in argv:
            if n.endswith(a):
                if os.path.exists(n):
                    pf(t,n)
                break
    else: # none requested so print all
        if os.path.exists(n):
            pf(t,n)
i think the only things it uses from sfc are os and cd (called with no args changes directory to user home directory.
Tradition is peer pressure from dead people

What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Reply
#2
you need to know the file encoding.
Though not foolproof, you can usually find it with chardet: https://pypi.org/project/chardet/
Reply
#3
maybe it will be simpler to read most of the file in binary and count the b'\n' in the bytes string i get. really big files i don't need the exact number, just a general size.
Tradition is peer pressure from dead people

What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Reply
#4
There are some strange encodings, and the easiest way to deal with them is to find the proper encoding.
Reply
#5
if it's a strange encoding i don't care if this is accurate. this is operation on my edit log which has time of edit and a file path. this script lists recently edited files with time. i'm inserting a column for number of lines. for my Python source files and most other text and source, the encoding will be ASCII or UTF-8. the file causing problem was a C file from 2006 when i was using ISO8859 for the copyright symbol. cat that line today (the Linux kernel is doing UTF-8 reasonably well) and the copyright is just a question in an inverted cell.

i've already done this and have it showing ">9999" (if the number of lines exceeds 9999) and narrowed it to 5 characters. if the file exceeds 1048575 bytes then it prints ">####". i do f.read(1048576). i could reduce that.
Tradition is peer pressure from dead people

What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Reply
#6
If I need to have quick look at number of lines in file then I use wc in terminal.

cat my_filename | wc -l
It is easy to use in Python with subprocess. Specific example on file named shakespeare.txt; check_output returns bytes so I converted it into integer:

>>> import subprocess
>>> source = subprocess.Popen(['cat', 'shakespeare.txt'], stdout=subprocess.PIPE)   
>>> lines = int(subprocess.check_output(['wc', '-l'], stdin=source.stdout))     
>>> lines
4155
I'm not 'in'-sane. Indeed, I am so far 'out' of sane that you appear a tiny blip on the distant coast of sanity. Bucky Katt, Get Fuzzy

Da Bishop: There's a dead bishop on the landing. I don't know who keeps bringing them in here. ....but society is to blame.
Reply
#7
An alternative with more_itertools.ilen()
λ cat paillasse/sometest.py | wc -l
60
λ python
...
>>> from more_itertools import ilen
>>> with open('paillasse/sometest.py') as ifh:
...     print(ilen(ifh))
... 
60
Reply
#8
(May-18-2021, 11:42 PM)Skaperen Wrote: if it's a strange encoding i don't care if this is accurate. this is operation on my edit log which has time of edit and a file path.
Can just ignore encoding errors,there is a parameter for this errors="ignore" or errors='replace'(will be ?).
So can do a version showing this can just copy my own code from this Thread and make a little change.
import os

def find_files(file_type, path):
    os.chdir(path)
    with os.scandir(path) as it:
        for entry in it:
            if entry.name.endswith(file_type) and entry.is_file():
                yield entry.name

def count_lines(files):
    for file in files:
        with open(file, encoding='utf-8', errors="ignore") as f:
            for line_nr, _ in enumerate(f, -1):
                pass
        yield file, line_nr + 1

if __name__ == '__main__':
    path = r'E:\div_code'
    file_type = '.txt'
    files = find_files(file_type, path)
    line_count = count_lines(files)
    print(list(line_count))
Output:
[('alice_in_wonderland.txt', 3599), ('test.txt', 1807), ('W2Testfile.txt', 1396)]
Compare with wc
λ wc -l *.txt
  1396 W2Testfile.txt
  3599 alice_in_wonderland.txt
  1807 test.txt
  6802 total
Reply
#9
Hi, I don't understand lambda in this post, can you please explain?

I tried:

Quote:λ wc -l /home/pedro/summer2021/19BE/scansforOCR/*.text

in bash and just got:

Quote:pedro@pedro-HP:~$ λ wc /home/pedro/summer2021/19BE/scansforOCR/*.text
λ: command not found
pedro@pedro-HP:~$
Reply
#10
(May-21-2021, 11:01 AM)Pedroski55 Wrote: Hi, I don't understand lambda in this post, can you please explain?
You shall not use λ ,it's default sign because i use cmder
wc --help
Usage: wc [OPTION]... [FILE]...
  or:  wc [OPTION]... --files0-from=F
Print newline, word, and byte counts for each FILE, and a total line if
more than one FILE is specified.  A word is a non-zero-length sequence of
characters delimited by white space.

With no FILE, or when FILE is -, read standard input.

The options below may be used to select which counts are printed, always in
the following order: newline, word, character, byte, maximum line length.
  -c, --bytes            print the byte counts
  -m, --chars            print the character counts
  -l, --lines            print the newline counts
      --files0-from=F    read input from the files specified by
                           NUL-terminated names in file F;
                           If F is - then read names from standard input
  -L, --max-line-length  print the maximum display width
  -w, --words            print the word counts
      --help     display this help and exit
      --version  output version information and exit

GNU coreutils online help: <http://www.gnu.org/software/coreutils/>
Report wc translation bugs to <http://translationproject.org/team/>
Full documentation at: <http://www.gnu.org/software/coreutils/wc>
or available locally via: info '(coreutils) wc invocation'
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Row Count and coloumn count Yegor123 4 1,332 Oct-18-2022, 03:52 AM
Last Post: Yegor123
  Delete multiple lines from txt file Lky 6 2,306 Jul-10-2022, 12:09 PM
Last Post: jefsummers
  failing to print not matched lines from second file tester_V 14 6,104 Apr-05-2022, 11:56 AM
Last Post: codinglearner
  Extracting Specific Lines from text file based on content. jokerfmj 8 3,039 Mar-28-2022, 03:38 PM
Last Post: snippsat
  Importing a function from another file runs the old lines also dedesssse 6 2,568 Jul-06-2021, 07:04 PM
Last Post: deanhystad
  [Solved] Trying to read specific lines from a file Laplace12 7 3,555 Jun-21-2021, 11:15 AM
Last Post: Laplace12
  Split Characters As Lines in File quest_ 3 2,530 Dec-28-2020, 09:31 AM
Last Post: quest_
  How to use the count function from an Excel file using Python? jpy 2 4,465 Dec-21-2020, 12:30 AM
Last Post: jpy
  Find lines from one file in another tester_V 8 3,407 Nov-15-2020, 03:29 AM
Last Post: tester_V
  get two characters, count and print from a .txt file Pleiades 9 3,386 Oct-05-2020, 09:22 AM
Last Post: perfringo

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020