Python Forum
How to read in mulitple files efficiently
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
How to read in mulitple files efficiently
#1
I have multiple files, all containing 10s of thousands of lines of data. I cannot alter the way the data is presented and it appears as seen in the attached screenshot.

The first 2 lines can be ignored; I am only interested in the lines beginning with "hydro".

Currently, my program reads the data in from each file selected as and when I select a graph...

for line in file:
                    if keyword.casefold() in line.casefold():
                        time     = self.convert(line[21:35])                   # Timestep
                        delta_t  = self.convert(line[36:50])                   # Change in timestep
                        mass     = self.convert(line[51:65])                   # Mass of star (excluding envelope)
                        radius   = self.convert(line[66:80])                   # Radius of star (excluding envelope)
                        lum_core = self.convert(line[81:95])                   # Luminosity of star (excluding envelope)
                        lum_tot  = self.convert(line[96:110])                  # Total luminosity (including enveloping cloud)
                        flux     = self.convert(line[111:125])                 # Mass flux
                        ratio    = float(line[125:137])                        # Ratio of star mass against mass of the Sun
                
                        # Store the data into the Numpy array 'data'
                        data = np.append(data, np.array([[time, delta_t, mass, radius, lum_core, lum_tot, flux, ratio]]), axis = 0)
The problem here is that this takes some time (several minutes) and is required each time I create a new graph based upon this data.

I did consider reading the data into a list as and when each file was selected, but this causes a delay between selecting a file and being able to select the next file, as the data is read into the list.

I thought of reading in on a separate thread, but then how do I ensure that the data added to the list is in the correct order?

I would appreciate any and all suggestions on how I might approach this problem.

Attached Files

Thumbnail(s)
   
Reply
#2
Maybe fileinput is something for you.

https://docs.python.org/3/library/fileinput.html
import fileinput
from pathlib import Path

# example text files
files = list(Path().glob("*.txt"))

with fileinput.input(files, encoding="utf8") as file_input:
    for line in file_input:
        current_filename = file_input.filename()
        current_file_line = file_input.filelineno()
        global_lineno = file_input.lineno()
        
        # skipping line 1 and 2
        if current_file_line < 3:
            continue
        
        # next file e.g. at line 10
        if current_file_line == 10:
            file_input.nextfile()
            # maybe you want to continue here,
            # if line 10 should not be processed

        print(f"{current_filename} - {current_file_line} - {global_lineno}")
        # print(line, end="")
Almost dead, but too lazy to die: https://sourceserver.info
All humans together. We don't need politicians!
Reply
#3
Thank you. To be honest, the routine for reading the files in works; it just works slowly.

I want to shift the reading in to a single event, where all data is read into a single list. The problem is, if I do that as the files are selected then the delay between reading one file in and the opportunity to select the next one is too great.
Reply
#4
If your data has a fixed-width-format, your approach is ok.
You can refactor the line_paring into a new function.
You could also create a data-structure, where function/method, start and end is defined for each field.



# METHOD
def convert_fields1(self, line: str) -> dict:

    FIELDS = {
        "time": (self.convert, 21, 35),
        "delta_t": (self.convert, 36, 50),
        "mass": (self.convert, 51, 65),
        "radius": (self.convert, 66, 80),
        "lum_core": (self.convert, 81, 95),
        "lum_tot": (self.convert, 96, 110),
        "flux": (self.convert, 111, 125),
        "ratio": (float, 125, 137),
    }

    return {
        field: convert(line[start:end])
        for field, (convert, start, end) in FIELDS.items()
    }


# METHOD
def convert_fields2(self, line: str) -> list:
    """
    Method uses self.convert() and float()
    to convert the data-types for each field of the line.
    """

    # Could also be defined as a Classvariable
    # (method/function, start, end)
    FIELDS = (
        (self.convert, 21, 35),
        (self.convert, 36, 50),
        (self.convert, 51, 65),
        (self.convert, 66, 80),
        (self.convert, 81, 95),
        (self.convert, 96, 110),
        (self.convert, 111, 125),
        (float, 125, 137),
    )

    return [convert(line[start:end]) for convert, start, end in FIELDS]


### Placeholder for self.convert ###
# instead you use your own implementation
class Converter:
    @staticmethod
    def convert(value):
        return float(value)


converter = Converter()
#####################################

# a line to check if the boundaries are right
line_test = "--------------------111111111111111/22222222222222/33333333333333/44444444444444/55555555555555/66666666666666/777777777777771.2345678912@@@@@@@@"

# returns a dict
# the first argument 'converter' is to fake your class
print(convert_fields1(converter, line_test))

# returns a list, field names are "unknown"
print(convert_fields2(converter, line_test))

# I guess all datatypes are floats, because an np.array can
# keep only data with the same datatype
import numpy as np

# using the dict reqquires to get the values
# but it's inefficient. Creating dicts take longer than a list,
# because all keys must be hashed
new_data1 = np.array(convert_fields1(converter, line_test).values())
print(new_data1)

# List-Version
# here is .values() not required, because the values
# are in a list
new_data2 = np.array(convert_fields2(converter, line_test))
print(new_data2)
garynewport likes this post
Almost dead, but too lazy to die: https://sourceserver.info
All humans together. We don't need politicians!
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Read directory listing of files and parse out the highest number? cubangt 5 2,254 Sep-28-2022, 10:15 PM
Last Post: Larz60+
  Python code to read second line from CSV files and create a master CSV file sh1704 1 2,353 Feb-13-2022, 07:13 PM
Last Post: menator01
  How to efficiently average same entries of lists in a list xquad 5 2,071 Dec-17-2021, 04:44 PM
Last Post: xquad
  Open and read multiple text files and match words kozaizsvemira 3 6,675 Jul-07-2021, 11:27 AM
Last Post: Larz60+
  code to read files in folders and transfer the file name, type, date created to excel Divya577 0 1,835 Dec-06-2020, 04:14 PM
Last Post: Divya577
  How to read csv files parallay Mekala 2 1,948 Oct-24-2020, 07:33 AM
Last Post: Mekala
  Read KML files, edit items, and rewrite files? Winfried 4 4,700 Aug-21-2020, 03:55 PM
Last Post: Winfried
  Python: Automated Script to Read Multiple Files in Respective Matrices Robotguy 7 4,121 Jul-03-2020, 01:34 AM
Last Post: bowlofred
  Read Multiples Text Files get specific lines based criteria zinho 5 3,053 May-19-2020, 12:30 PM
Last Post: zinho
  getting inputs efficiently arman888 2 1,870 May-19-2020, 05:00 AM
Last Post: pyzyx3qwerty

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020