Posts: 4,647
Threads: 1,494
Joined: Sep 2016
i am getting data in an ordered sequence of buffers one at a time. an example is reading a data pipe 4096 bytes at a time. i want to count the number of lines in this whole data based on the line ending sequence of the platform it is running on (os.sep). when the line ending sequence is longer than 1 character (len(os.sep)>1) it is possible for a line ending sequence to be split between buffers. does anyone know a good way to accurately count the line endings when getting or accessing these buffers one at a time without big memory usage (do not collect the whole data sequence all at once)?
an alternate goal is to count lines based on each line ending in
any valid line ending (not necessarily the same as other lines).
Tradition is peer pressure from dead people
What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Posts: 4,647
Threads: 1,494
Joined: Sep 2016
ah, yes, os.sep is the file path separator. oops on me. and := is a new thing i am still unfamiliar with. so, this code does not make sense to me. it is unclear to me how this code compares in 2 buffers at the same time.
i am assuming that a linesep is never spread across more than 2 buffers. but if reading buffers might get a length of 1 when a linesep could be longer than 2 (not today) that could be (rare) issue.
Tradition is peer pressure from dead people
What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Posts: 4,647
Threads: 1,494
Joined: Sep 2016
such tests would have to simulate
os.linesep being 3 characters or more, and very small buffers.
Tradition is peer pressure from dead people
What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Posts: 4,647
Threads: 1,494
Joined: Sep 2016
this doc and code is understandable to me.
i'm now thinking that i may be reading data that has line separators that may not match what the platform normally uses, such as Windows or Mac files transferred in binary to Linux (it happens). the way i have dealt with this back in my days of C programming is to allow any mix of CR LF VT FF in as many as 4 bytes to mean a new line (plus whatever else based on what is there). but where anything repeats, it means another line (so, "foo\r\n\r\vbar" would be separated by at least one blank line). it might be a little more complicated to process but should cover almost all real life cases.
Tradition is peer pressure from dead people
What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.