How to read in mulitple files efficiently - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: General Coding Help (https://python-forum.io/forum-8.html) +--- Thread: How to read in mulitple files efficiently (/thread-39288.html) |
How to read in mulitple files efficiently - garynewport - Jan-26-2023 I have multiple files, all containing 10s of thousands of lines of data. I cannot alter the way the data is presented and it appears as seen in the attached screenshot. The first 2 lines can be ignored; I am only interested in the lines beginning with "hydro". Currently, my program reads the data in from each file selected as and when I select a graph... for line in file: if keyword.casefold() in line.casefold(): time = self.convert(line[21:35]) # Timestep delta_t = self.convert(line[36:50]) # Change in timestep mass = self.convert(line[51:65]) # Mass of star (excluding envelope) radius = self.convert(line[66:80]) # Radius of star (excluding envelope) lum_core = self.convert(line[81:95]) # Luminosity of star (excluding envelope) lum_tot = self.convert(line[96:110]) # Total luminosity (including enveloping cloud) flux = self.convert(line[111:125]) # Mass flux ratio = float(line[125:137]) # Ratio of star mass against mass of the Sun # Store the data into the Numpy array 'data' data = np.append(data, np.array([[time, delta_t, mass, radius, lum_core, lum_tot, flux, ratio]]), axis = 0)The problem here is that this takes some time (several minutes) and is required each time I create a new graph based upon this data. I did consider reading the data into a list as and when each file was selected, but this causes a delay between selecting a file and being able to select the next file, as the data is read into the list. I thought of reading in on a separate thread, but then how do I ensure that the data added to the list is in the correct order? I would appreciate any and all suggestions on how I might approach this problem. RE: How to read in mulitple files efficiently - DeaD_EyE - Jan-26-2023 Maybe fileinput is something for you.https://docs.python.org/3/library/fileinput.html import fileinput from pathlib import Path # example text files files = list(Path().glob("*.txt")) with fileinput.input(files, encoding="utf8") as file_input: for line in file_input: current_filename = file_input.filename() current_file_line = file_input.filelineno() global_lineno = file_input.lineno() # skipping line 1 and 2 if current_file_line < 3: continue # next file e.g. at line 10 if current_file_line == 10: file_input.nextfile() # maybe you want to continue here, # if line 10 should not be processed print(f"{current_filename} - {current_file_line} - {global_lineno}") # print(line, end="") RE: How to read in mulitple files efficiently - garynewport - Jan-27-2023 Thank you. To be honest, the routine for reading the files in works; it just works slowly. I want to shift the reading in to a single event, where all data is read into a single list. The problem is, if I do that as the files are selected then the delay between reading one file in and the opportunity to select the next one is too great. RE: How to read in mulitple files efficiently - DeaD_EyE - Jan-27-2023 If your data has a fixed-width-format, your approach is ok. You can refactor the line_paring into a new function. You could also create a data-structure, where function/method, start and end is defined for each field. # METHOD def convert_fields1(self, line: str) -> dict: FIELDS = { "time": (self.convert, 21, 35), "delta_t": (self.convert, 36, 50), "mass": (self.convert, 51, 65), "radius": (self.convert, 66, 80), "lum_core": (self.convert, 81, 95), "lum_tot": (self.convert, 96, 110), "flux": (self.convert, 111, 125), "ratio": (float, 125, 137), } return { field: convert(line[start:end]) for field, (convert, start, end) in FIELDS.items() } # METHOD def convert_fields2(self, line: str) -> list: """ Method uses self.convert() and float() to convert the data-types for each field of the line. """ # Could also be defined as a Classvariable # (method/function, start, end) FIELDS = ( (self.convert, 21, 35), (self.convert, 36, 50), (self.convert, 51, 65), (self.convert, 66, 80), (self.convert, 81, 95), (self.convert, 96, 110), (self.convert, 111, 125), (float, 125, 137), ) return [convert(line[start:end]) for convert, start, end in FIELDS] ### Placeholder for self.convert ### # instead you use your own implementation class Converter: @staticmethod def convert(value): return float(value) converter = Converter() ##################################### # a line to check if the boundaries are right line_test = "--------------------111111111111111/22222222222222/33333333333333/44444444444444/55555555555555/66666666666666/777777777777771.2345678912@@@@@@@@" # returns a dict # the first argument 'converter' is to fake your class print(convert_fields1(converter, line_test)) # returns a list, field names are "unknown" print(convert_fields2(converter, line_test)) # I guess all datatypes are floats, because an np.array can # keep only data with the same datatype import numpy as np # using the dict reqquires to get the values # but it's inefficient. Creating dicts take longer than a list, # because all keys must be hashed new_data1 = np.array(convert_fields1(converter, line_test).values()) print(new_data1) # List-Version # here is .values() not required, because the values # are in a list new_data2 = np.array(convert_fields2(converter, line_test)) print(new_data2) |