all i want to do is count the lines in each file but there are strange binary bytes or older ISO codes.
Output:
Traceback (most recent call last):
File "last-edited.py", line 38, in <module>
pf(t,n)
File "last-edited.py", line 7, in pf
c = len([x for x in f])
File "last-edited.py", line 7, in <listcomp>
c = len([x for x in f])
File "/usr/host/bin/../../lib/python3.6/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa9 in position 93: invalid start byte
from sfc import *
def pf(t,n):
if os.path.exists(n):
with open(n) as f:
c = len([x for x in f])
print(t,str(c).rjust(8),n)
fn = '.edit_log'
argv.pop(0)
cd() # be in home directory
with open(fn) as el:
nt = {}
for ee in el:
t,e,n = ee.strip().split()[:3]
if n in nt:
nt[n].append(t)
else:
nt[n] = [t]
ns = sorted(nt.keys())
tn = {}
for n in ns:
nt[n].sort()
for n in ns:
t = nt[n][-1]
if t[13]=='-':
t=t[0:7]+t[8:13]+t[14:16]+t[17:]
tn[t] = n
for t,n in sorted([x for x in sorted(tn.items())]):
if argv:
for a in argv:
if n.endswith(a):
if os.path.exists(n):
pf(t,n)
break
else: # none requested so print all
if os.path.exists(n):
pf(t,n)
i think the only things it uses from sfc are os and cd (called with no args changes directory to user home directory.
maybe it will be simpler to read most of the file in binary and count the b'\n' in the bytes string i get. really big files i don't need the exact number, just a general size.
There are some strange encodings, and the easiest way to deal with them is to find the proper encoding.
if it's a strange encoding i don't care if this is accurate. this is operation on my edit log which has time of edit and a file path. this script lists recently edited files with time. i'm inserting a column for number of lines. for my Python source files and most other text and source, the encoding will be ASCII or UTF-8. the file causing problem was a C file from 2006 when i was using ISO8859 for the copyright symbol. cat that line today (the Linux kernel is doing UTF-8 reasonably well) and the copyright is just a question in an inverted cell.
i've already done this and have it showing ">9999" (if the number of lines exceeds 9999) and narrowed it to 5 characters. if the file exceeds 1048575 bytes then it prints ">####". i do f.read(1048576)
. i could reduce that.
If I need to have quick look at number of lines in file then I use
wc
in terminal.
cat my_filename | wc -l
It is easy to use in Python with subprocess. Specific example on file named shakespeare.txt; check_output returns bytes so I converted it into integer:
>>> import subprocess
>>> source = subprocess.Popen(['cat', 'shakespeare.txt'], stdout=subprocess.PIPE)
>>> lines = int(subprocess.check_output(['wc', '-l'], stdin=source.stdout))
>>> lines
4155
An alternative with
more_itertools.ilen()
λ cat paillasse/sometest.py | wc -l
60
λ python
...
>>> from more_itertools import ilen
>>> with open('paillasse/sometest.py') as ifh:
... print(ilen(ifh))
...
60
(May-18-2021, 11:42 PM)Skaperen Wrote: [ -> ]if it's a strange encoding i don't care if this is accurate. this is operation on my edit log which has time of edit and a file path.
Can just ignore encoding errors,there is a parameter for this
errors="ignore"
or
errors='replace'
(will be
?
).
So can do a version showing this can just copy my own code from this
Thread and make a little change.
import os
def find_files(file_type, path):
os.chdir(path)
with os.scandir(path) as it:
for entry in it:
if entry.name.endswith(file_type) and entry.is_file():
yield entry.name
def count_lines(files):
for file in files:
with open(file, encoding='utf-8', errors="ignore") as f:
for line_nr, _ in enumerate(f, -1):
pass
yield file, line_nr + 1
if __name__ == '__main__':
path = r'E:\div_code'
file_type = '.txt'
files = find_files(file_type, path)
line_count = count_lines(files)
print(list(line_count))
Output:
[('alice_in_wonderland.txt', 3599), ('test.txt', 1807), ('W2Testfile.txt', 1396)]
Compare with
wc
λ wc -l *.txt
1396 W2Testfile.txt
3599 alice_in_wonderland.txt
1807 test.txt
6802 total
Hi, I don't understand lambda in this post, can you please explain?
I tried:
Quote:λ wc -l /home/pedro/summer2021/19BE/scansforOCR/*.text
in bash and just got:
Quote:pedro@pedro-HP:~$ λ wc /home/pedro/summer2021/19BE/scansforOCR/*.text
λ: command not found
pedro@pedro-HP:~$
(May-21-2021, 11:01 AM)Pedroski55 Wrote: [ -> ]Hi, I don't understand lambda in this post, can you please explain?
You shall not use
λ
,it's default sign because i use
cmder
wc --help
Usage: wc [OPTION]... [FILE]...
or: wc [OPTION]... --files0-from=F
Print newline, word, and byte counts for each FILE, and a total line if
more than one FILE is specified. A word is a non-zero-length sequence of
characters delimited by white space.
With no FILE, or when FILE is -, read standard input.
The options below may be used to select which counts are printed, always in
the following order: newline, word, character, byte, maximum line length.
-c, --bytes print the byte counts
-m, --chars print the character counts
-l, --lines print the newline counts
--files0-from=F read input from the files specified by
NUL-terminated names in file F;
If F is - then read names from standard input
-L, --max-line-length print the maximum display width
-w, --words print the word counts
--help display this help and exit
--version output version information and exit
GNU coreutils online help: <http://www.gnu.org/software/coreutils/>
Report wc translation bugs to <http://translationproject.org/team/>
Full documentation at: <http://www.gnu.org/software/coreutils/wc>
or available locally via: info '(coreutils) wc invocation'