How to read in mulitple files efficiently

How to read in mulitple files efficiently - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: General Coding Help (https://python-forum.io/forum-8.html)
+--- Thread: How to read in mulitple files efficiently (/thread-39288.html)

How to read in mulitple files efficiently - garynewport - Jan-26-2023

I have multiple files, all containing 10s of thousands of lines of data. I cannot alter the way the data is presented and it appears as seen in the attached screenshot.

The first 2 lines can be ignored; I am only interested in the lines beginning with "hydro".

Currently, my program reads the data in from each file selected as and when I select a graph...

for line in file:
                    if keyword.casefold() in line.casefold():
                        time     = self.convert(line[21:35])                   # Timestep
                        delta_t  = self.convert(line[36:50])                   # Change in timestep
                        mass     = self.convert(line[51:65])                   # Mass of star (excluding envelope)
                        radius   = self.convert(line[66:80])                   # Radius of star (excluding envelope)
                        lum_core = self.convert(line[81:95])                   # Luminosity of star (excluding envelope)
                        lum_tot  = self.convert(line[96:110])                  # Total luminosity (including enveloping cloud)
                        flux     = self.convert(line[111:125])                 # Mass flux
                        ratio    = float(line[125:137])                        # Ratio of star mass against mass of the Sun
                
                        # Store the data into the Numpy array 'data'
                        data = np.append(data, np.array([[time, delta_t, mass, radius, lum_core, lum_tot, flux, ratio]]), axis = 0)

The problem here is that this takes some time (several minutes) and is required each time I create a new graph based upon this data.

I did consider reading the data into a list as and when each file was selected, but this causes a delay between selecting a file and being able to select the next file, as the data is read into the list.

I thought of reading in on a separate thread, but then how do I ensure that the data added to the list is in the correct order?

I would appreciate any and all suggestions on how I might approach this problem.

RE: How to read in mulitple files efficiently - DeaD_EyE - Jan-26-2023

Maybe fileinput is something for you.

https://docs.python.org/3/library/fileinput.html

import fileinput
from pathlib import Path

# example text files
files = list(Path().glob("*.txt"))

with fileinput.input(files, encoding="utf8") as file_input:
    for line in file_input:
        current_filename = file_input.filename()
        current_file_line = file_input.filelineno()
        global_lineno = file_input.lineno()
        
        # skipping line 1 and 2
        if current_file_line < 3:
            continue
        
        # next file e.g. at line 10
        if current_file_line == 10:
            file_input.nextfile()
            # maybe you want to continue here,
            # if line 10 should not be processed

        print(f"{current_filename} - {current_file_line} - {global_lineno}")
        # print(line, end="")

RE: How to read in mulitple files efficiently - garynewport - Jan-27-2023

Thank you. To be honest, the routine for reading the files in works; it just works slowly.

I want to shift the reading in to a single event, where all data is read into a single list. The problem is, if I do that as the files are selected then the delay between reading one file in and the opportunity to select the next one is too great.

RE: How to read in mulitple files efficiently - DeaD_EyE - Jan-27-2023

If your data has a fixed-width-format, your approach is ok.
You can refactor the line_paring into a new function.
You could also create a data-structure, where function/method, start and end is defined for each field.

# METHOD
def convert_fields1(self, line: str) -> dict:

    FIELDS = {
        "time": (self.convert, 21, 35),
        "delta_t": (self.convert, 36, 50),
        "mass": (self.convert, 51, 65),
        "radius": (self.convert, 66, 80),
        "lum_core": (self.convert, 81, 95),
        "lum_tot": (self.convert, 96, 110),
        "flux": (self.convert, 111, 125),
        "ratio": (float, 125, 137),
    }

    return {
        field: convert(line[start:end])
        for field, (convert, start, end) in FIELDS.items()
    }


# METHOD
def convert_fields2(self, line: str) -> list:
    """
    Method uses self.convert() and float()
    to convert the data-types for each field of the line.
    """

    # Could also be defined as a Classvariable
    # (method/function, start, end)
    FIELDS = (
        (self.convert, 21, 35),
        (self.convert, 36, 50),
        (self.convert, 51, 65),
        (self.convert, 66, 80),
        (self.convert, 81, 95),
        (self.convert, 96, 110),
        (self.convert, 111, 125),
        (float, 125, 137),
    )

    return [convert(line[start:end]) for convert, start, end in FIELDS]


### Placeholder for self.convert ###
# instead you use your own implementation
class Converter:
    @staticmethod
    def convert(value):
        return float(value)


converter = Converter()
#####################################

# a line to check if the boundaries are right
line_test = "--------------------111111111111111/22222222222222/33333333333333/44444444444444/55555555555555/66666666666666/777777777777771.2345678912@@@@@@@@"

# returns a dict
# the first argument 'converter' is to fake your class
print(convert_fields1(converter, line_test))

# returns a list, field names are "unknown"
print(convert_fields2(converter, line_test))

# I guess all datatypes are floats, because an np.array can
# keep only data with the same datatype
import numpy as np

# using the dict reqquires to get the values
# but it's inefficient. Creating dicts take longer than a list,
# because all keys must be hashed
new_data1 = np.array(convert_fields1(converter, line_test).values())
print(new_data1)

# List-Version
# here is .values() not required, because the values
# are in a list
new_data2 = np.array(convert_fields2(converter, line_test))
print(new_data2)