What is the best way to merged csv files and commit them to a database?

SuchUmami · (This post was last modified: Jul-09-2023, 07:51 AM by SuchUmami.)

I have several csv files with different kinds of data on them. Currently I am making a dataframe for each one and then merging the dataframes. I also have to do a process on the data (calculating total scores) and then I'm committing them to a database. It works but I find it hard to implement new kinds of csv files with different data on them.

I was wondering what other people do to make this process easy and scalable?

Pedroski55 · Jul-09-2023, 11:16 PM

Do the csv files have the same headers? Does the data overlap?

What database are you writing the output to? MySQL? If it is

Below is a way of merging any number of csv files, but it will include the header rows. Could fix that if we knew the structure of the csv files.

from pathlib import Path
import os
# version 3.10 uses | as union
# this version, 3.8 doesn't below is a workaround
# from __future__ import annotations
# this will convert all annotations to str, and mypy is used to do the type checking
# in addition, the IDE gets the TypeHints.
# could get the in_files list with glob if there are a lot of files
in_files = ['/home/pedro/myPython/csv/csv/time.csv', '/home/pedro/myPython/csv/csv/randomdups.csv']
out_file = '/home/pedro/myPython/csv/csv/merged_output.csv'

def merge_files(input_files: list[str | Path], output_file: str | Path) -> None:
    # user of code could supply a list with paths as str
    # or the paths as Path objectc
    # but later Path objects are required, so all objects
    # are converted here
    input_files = [Path(file) for file in input_files]
    output_file = Path(output_file)
     
    # opening file in binary mode, which prevents encoding errors
    # if all input_files are not encoded with the same encoding
    # Path objects do have the open method, which supports context managers
    with output_file.open("wb") as fd_out:
        # iterating over all input files
        for input_file in input_files:
            # same here, opening the input file in binary mode
            with input_file.open("rb") as fd_in:
                # https://realpython.com/lessons/assignment-expressions/
                # read 4KiB chunks
                while chunk := fd_in.read(4 * 1024 ** 1):
                    fd_out.write(chunk)
                 
                # if the last line of input_file has no
                # lineseperator at the end, adding one to it
                # keep in mind, that everything is in binary mode                 
                # os.linesep is OS depended
                if not chunk.endswith(os.linesep.encode("ascii")):
                    fd_out.write(os.linesep.encode("ascii"))

if __name__ == "__main__":
    merge_files(in_files, out_file)

**deanhystad** · Jul-10-2023, 02:31 AM

If you need to do processing on the data, probably not. That implies that the CSV files cannot be treated generically. Just opening CSV files, combining them, and writing them as a table in a database could easily be written as a script that knows very little about the contents of the CSV files.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Merged multiple csv to one	SamLiu	7	1,467	Sep-19-2022, 02:20 PM Last Post: SamLiu

What is the best way to merged csv files and commit them to a database?

User Panel Messages

Announcements