Posts: 4,646
Threads: 1,493
Joined: Sep 2016
all i want to do is count the lines in each file but there are strange binary bytes or older ISO codes.
Output: Traceback (most recent call last):
File "last-edited.py", line 38, in <module>
pf(t,n)
File "last-edited.py", line 7, in pf
c = len([x for x in f])
File "last-edited.py", line 7, in <listcomp>
c = len([x for x in f])
File "/usr/host/bin/../../lib/python3.6/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa9 in position 93: invalid start byte
from sfc import *
def pf(t,n):
if os.path.exists(n):
with open(n) as f:
c = len([x for x in f])
print(t,str(c).rjust(8),n)
fn = '.edit_log'
argv.pop(0)
cd() # be in home directory
with open(fn) as el:
nt = {}
for ee in el:
t,e,n = ee.strip().split()[:3]
if n in nt:
nt[n].append(t)
else:
nt[n] = [t]
ns = sorted(nt.keys())
tn = {}
for n in ns:
nt[n].sort()
for n in ns:
t = nt[n][-1]
if t[13]=='-':
t=t[0:7]+t[8:13]+t[14:16]+t[17:]
tn[t] = n
for t,n in sorted([x for x in sorted(tn.items())]):
if argv:
for a in argv:
if n.endswith(a):
if os.path.exists(n):
pf(t,n)
break
else: # none requested so print all
if os.path.exists(n):
pf(t,n) i think the only things it uses from sfc are os and cd (called with no args changes directory to user home directory.
Tradition is peer pressure from dead people
What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Posts: 12,022
Threads: 484
Joined: Sep 2016
you need to know the file encoding.
Though not foolproof, you can usually find it with chardet: https://pypi.org/project/chardet/
Posts: 4,646
Threads: 1,493
Joined: Sep 2016
maybe it will be simpler to read most of the file in binary and count the b'\n' in the bytes string i get. really big files i don't need the exact number, just a general size.
Tradition is peer pressure from dead people
What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Posts: 12,022
Threads: 484
Joined: Sep 2016
There are some strange encodings, and the easiest way to deal with them is to find the proper encoding.
Posts: 4,646
Threads: 1,493
Joined: Sep 2016
May-18-2021, 11:42 PM
(This post was last modified: May-18-2021, 11:43 PM by Skaperen.)
if it's a strange encoding i don't care if this is accurate. this is operation on my edit log which has time of edit and a file path. this script lists recently edited files with time. i'm inserting a column for number of lines. for my Python source files and most other text and source, the encoding will be ASCII or UTF-8. the file causing problem was a C file from 2006 when i was using ISO8859 for the copyright symbol. cat that line today (the Linux kernel is doing UTF-8 reasonably well) and the copyright is just a question in an inverted cell.
i've already done this and have it showing ">9999" (if the number of lines exceeds 9999) and narrowed it to 5 characters. if the file exceeds 1048575 bytes then it prints ">####". i do f.read(1048576) . i could reduce that.
Tradition is peer pressure from dead people
What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Posts: 1,950
Threads: 8
Joined: Jun 2018
If I need to have quick look at number of lines in file then I use wc in terminal.
cat my_filename | wc -l It is easy to use in Python with subprocess. Specific example on file named shakespeare.txt; check_output returns bytes so I converted it into integer:
>>> import subprocess
>>> source = subprocess.Popen(['cat', 'shakespeare.txt'], stdout=subprocess.PIPE)
>>> lines = int(subprocess.check_output(['wc', '-l'], stdin=source.stdout))
>>> lines
4155
I'm not 'in'-sane. Indeed, I am so far 'out' of sane that you appear a tiny blip on the distant coast of sanity. Bucky Katt, Get Fuzzy
Da Bishop: There's a dead bishop on the landing. I don't know who keeps bringing them in here. ....but society is to blame.
Posts: 4,783
Threads: 76
Joined: Jan 2018
May-19-2021, 08:58 AM
(This post was last modified: May-19-2021, 08:59 AM by Gribouillis.)
An alternative with more_itertools.ilen()
λ cat paillasse/sometest.py | wc -l
60
λ python
...
>>> from more_itertools import ilen
>>> with open('paillasse/sometest.py') as ifh:
... print(ilen(ifh))
...
60
Posts: 7,312
Threads: 123
Joined: Sep 2016
May-19-2021, 12:02 PM
(This post was last modified: May-19-2021, 12:02 PM by snippsat.)
(May-18-2021, 11:42 PM)Skaperen Wrote: if it's a strange encoding i don't care if this is accurate. this is operation on my edit log which has time of edit and a file path. Can just ignore encoding errors,there is a parameter for this errors="ignore" or errors='replace' (will be ? ).
So can do a version showing this can just copy my own code from this Thread and make a little change.
import os
def find_files(file_type, path):
os.chdir(path)
with os.scandir(path) as it:
for entry in it:
if entry.name.endswith(file_type) and entry.is_file():
yield entry.name
def count_lines(files):
for file in files:
with open(file, encoding='utf-8', errors="ignore") as f:
for line_nr, _ in enumerate(f, -1):
pass
yield file, line_nr + 1
if __name__ == '__main__':
path = r'E:\div_code'
file_type = '.txt'
files = find_files(file_type, path)
line_count = count_lines(files)
print(list(line_count)) Output: [('alice_in_wonderland.txt', 3599), ('test.txt', 1807), ('W2Testfile.txt', 1396)]
Compare with wc
λ wc -l *.txt
1396 W2Testfile.txt
3599 alice_in_wonderland.txt
1807 test.txt
6802 total
Posts: 1,090
Threads: 143
Joined: Jul 2017
May-21-2021, 11:01 AM
(This post was last modified: May-21-2021, 11:02 AM by Pedroski55.)
Hi, I don't understand lambda in this post, can you please explain?
I tried:
Quote:λ wc -l /home/pedro/summer2021/19BE/scansforOCR/*.text
in bash and just got:
Quote:pedro@pedro-HP:~$ λ wc /home/pedro/summer2021/19BE/scansforOCR/*.text
λ: command not found
pedro@pedro-HP:~$
Posts: 7,312
Threads: 123
Joined: Sep 2016
(May-21-2021, 11:01 AM)Pedroski55 Wrote: Hi, I don't understand lambda in this post, can you please explain? You shall not use λ ,it's default sign because i use cmder
wc --help
Usage: wc [OPTION]... [FILE]...
or: wc [OPTION]... --files0-from=F
Print newline, word, and byte counts for each FILE, and a total line if
more than one FILE is specified. A word is a non-zero-length sequence of
characters delimited by white space.
With no FILE, or when FILE is -, read standard input.
The options below may be used to select which counts are printed, always in
the following order: newline, word, character, byte, maximum line length.
-c, --bytes print the byte counts
-m, --chars print the character counts
-l, --lines print the newline counts
--files0-from=F read input from the files specified by
NUL-terminated names in file F;
If F is - then read names from standard input
-L, --max-line-length print the maximum display width
-w, --words print the word counts
--help display this help and exit
--version output version information and exit
GNU coreutils online help: <http://www.gnu.org/software/coreutils/>
Report wc translation bugs to <http://translationproject.org/team/>
Full documentation at: <http://www.gnu.org/software/coreutils/wc>
or available locally via: info '(coreutils) wc invocation'
|