Python Forum

Full Version: Merge CSV Column using Pandas Data Frames
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
I'm new to python so there's plenty of things which I likely could improve upon with my code but the main focus is, "How to merge columns of data from a multiple csv files into a single data frame and then write that data frame to a new file". The contents of the data is server resource metric data. I export this data from the monitoring tool. The code I have written will join the files of which have the same number of columns:
import os, glob, shutil, pandas as pd
from datetime import datetime


app = 'app1'
metric = 'disk'


#File config settings
now = datetime.now()
date = now.strftime("%Y%m%d_%H%M%S")
source = ('C:\scripts\files\\' + app + '\\' + metric)
os.chdir(source)
csvFiles = glob.glob('*.csv')


#Sort the files prior to processing
csvFiles.sort()
sorted(csvFiles)


#Setup first data frame with the merge Date Time for all files
for files in csvFiles:
    df = pd.read_csv(files)
    df = df.drop(columns=['Date Time', ...])  #List of all the columns in the file except for Date Time(RAW) [not shown in the drop command to the right but its one of the many columns in the monitoring data and found in all metric types].  This is used to start the dataframe and the following for loop is to add the data to this one.
    

#merge all files on columns we only want to use matching on Date Time
for files in csvFiles:
    df_next = pd.read_csv(files)
    df_next = df_next.drop(columns=['Date Time', ...)  #same as above except I don't remove the columns I want to keep.
        

#Create a new header using the server name
df_cols = ['data_time', ...] #A list of all columns headers where I manually add the server name to the metric (ex. server1_diskIO).  So for 10 files, saving the 3 columns for each file would produce 30 headers +1 for the Date Time(RAW)
df.columns = df_cols
df.to_csv(source + '\combined\merged_' + app + '_' + metric + '_data-' + date + '.csv', index=False)


#Move the files to historical directory
for files in csvFiles:
    print(files + 'moved to ' + source + '\historical')
    shutil.move(files, source + '\historical')
-----------------------
Following one is not working
-----------------------
The above code works for files which I have with the same number of columns. Problem with CPU metric is that some servers have 4 CPUs and others have 8, 16, or 32. So each file would have different number of columns to drop and that blows up my code (which is lacking error handling - still learning). So I attempted attempted to filter on just Date Time, Date Time(RAW) to start [I couldn't do just one because it became a string value instead of dataframe] then attempted to merge just the filtered data in the second for loop on Date Time(RAW) from the following for loop. This created the columns and the headers but none of the data stored.

import os, glob, shutil, pandas as pd, numpy as np
from datetime import datetime


#Config settings
application = 'app1'
metric = 'cpu'


#File config settings
now = datetime.now()
date = now.strftime("%Y%m%d_%H%M%S")
source = ('C:\scripts\files\\' + app + '\\' + metric)
os.chdir(source)
csvFiles = glob.glob('*.csv')


#Sort the files prior to processing
csvFiles.sort()
sorted(csvFiles)


#Setup first data frame with the merge Date Time for all files
for files in csvFiles:
    df = pd.read_csv(files, names = ['Date Time', 'Date Time(RAW)'])
    

#df = df.drop(columns=['Date Time', ...])  #same method as above where I read in the file and drop the columns (lots of them) that I don't want to see
#merge all files on columns we only want to use matching on Date Time
for files in csvFiles:
    df_next = pd.read_csv(files, names = ['Date Time(RAW)', 'Total(RAW)', 'Percentile (Total)(RAW)']) #initial data frame had both Date Time and Date Time(RAW) but when I bring in the new files I left off Date Time as I didn't want to build additional columns for Data Time while merging on Date Time(RAW)
    df = pd.merge(df, df_next, on = 'Date Time(RAW)')
    print(df)
        

#Create a new header using the server name
df_cols = ['DateTime', 'data_time', ...']   #Same as above, manually built a header to reflect the server name and the resource metric
df.columns = df_cols
df.to_csv(source + '\combined\merged_' + app + '_' + metric + '_data-' + date + '.csv', index=False)


#Move the files to historical directory
for files in csvFiles:
    print(files + 'moved to ' + source + '\historical')
    shutil.move(files, source + '\historical')
Any help would be appreciated. The end results is to merge the data of hundreds of servers into 1 single file for a particular metric. i.e:
Date Time(RAW),metric1, metric2,etc.
date,metricData,metricData,etc
date,metricData,metricData,etc
date,metricData,metricData,etc
date,metricData,metricData,etc
date,metricData,metricData,etc

This way we can build a repository of the data and look at things over time for the metrics we care most about.

NOTE: Error handling, using this structure to loop through multiple applications and metric types would be huge benefits which I still need to work on, too, but would require the building of the column names and dropping of column names to be dynamic, too.
I not sure whether I fully understood the question.

You may use concatenate data frames instead of the merge since you have a different number of columns.
See https://pandas.pydata.org/pandas-docs/st...rging.html

pd.concat([df,df_next], axis=0, ignore_index=True)