Python Forum
Merge CSV Column using Pandas Data Frames
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Merge CSV Column using Pandas Data Frames
#1
I'm new to python so there's plenty of things which I likely could improve upon with my code but the main focus is, "How to merge columns of data from a multiple csv files into a single data frame and then write that data frame to a new file". The contents of the data is server resource metric data. I export this data from the monitoring tool. The code I have written will join the files of which have the same number of columns:
import os, glob, shutil, pandas as pd
from datetime import datetime


app = 'app1'
metric = 'disk'


#File config settings
now = datetime.now()
date = now.strftime("%Y%m%d_%H%M%S")
source = ('C:\scripts\files\\' + app + '\\' + metric)
os.chdir(source)
csvFiles = glob.glob('*.csv')


#Sort the files prior to processing
csvFiles.sort()
sorted(csvFiles)


#Setup first data frame with the merge Date Time for all files
for files in csvFiles:
    df = pd.read_csv(files)
    df = df.drop(columns=['Date Time', ...])  #List of all the columns in the file except for Date Time(RAW) [not shown in the drop command to the right but its one of the many columns in the monitoring data and found in all metric types].  This is used to start the dataframe and the following for loop is to add the data to this one.
    

#merge all files on columns we only want to use matching on Date Time
for files in csvFiles:
    df_next = pd.read_csv(files)
    df_next = df_next.drop(columns=['Date Time', ...)  #same as above except I don't remove the columns I want to keep.
        

#Create a new header using the server name
df_cols = ['data_time', ...] #A list of all columns headers where I manually add the server name to the metric (ex. server1_diskIO).  So for 10 files, saving the 3 columns for each file would produce 30 headers +1 for the Date Time(RAW)
df.columns = df_cols
df.to_csv(source + '\combined\merged_' + app + '_' + metric + '_data-' + date + '.csv', index=False)


#Move the files to historical directory
for files in csvFiles:
    print(files + 'moved to ' + source + '\historical')
    shutil.move(files, source + '\historical')
-----------------------
Following one is not working
-----------------------
The above code works for files which I have with the same number of columns. Problem with CPU metric is that some servers have 4 CPUs and others have 8, 16, or 32. So each file would have different number of columns to drop and that blows up my code (which is lacking error handling - still learning). So I attempted attempted to filter on just Date Time, Date Time(RAW) to start [I couldn't do just one because it became a string value instead of dataframe] then attempted to merge just the filtered data in the second for loop on Date Time(RAW) from the following for loop. This created the columns and the headers but none of the data stored.

import os, glob, shutil, pandas as pd, numpy as np
from datetime import datetime


#Config settings
application = 'app1'
metric = 'cpu'


#File config settings
now = datetime.now()
date = now.strftime("%Y%m%d_%H%M%S")
source = ('C:\scripts\files\\' + app + '\\' + metric)
os.chdir(source)
csvFiles = glob.glob('*.csv')


#Sort the files prior to processing
csvFiles.sort()
sorted(csvFiles)


#Setup first data frame with the merge Date Time for all files
for files in csvFiles:
    df = pd.read_csv(files, names = ['Date Time', 'Date Time(RAW)'])
    

#df = df.drop(columns=['Date Time', ...])  #same method as above where I read in the file and drop the columns (lots of them) that I don't want to see
#merge all files on columns we only want to use matching on Date Time
for files in csvFiles:
    df_next = pd.read_csv(files, names = ['Date Time(RAW)', 'Total(RAW)', 'Percentile (Total)(RAW)']) #initial data frame had both Date Time and Date Time(RAW) but when I bring in the new files I left off Date Time as I didn't want to build additional columns for Data Time while merging on Date Time(RAW)
    df = pd.merge(df, df_next, on = 'Date Time(RAW)')
    print(df)
        

#Create a new header using the server name
df_cols = ['DateTime', 'data_time', ...']   #Same as above, manually built a header to reflect the server name and the resource metric
df.columns = df_cols
df.to_csv(source + '\combined\merged_' + app + '_' + metric + '_data-' + date + '.csv', index=False)


#Move the files to historical directory
for files in csvFiles:
    print(files + 'moved to ' + source + '\historical')
    shutil.move(files, source + '\historical')
Any help would be appreciated. The end results is to merge the data of hundreds of servers into 1 single file for a particular metric. i.e:
Date Time(RAW),metric1, metric2,etc.
date,metricData,metricData,etc
date,metricData,metricData,etc
date,metricData,metricData,etc
date,metricData,metricData,etc
date,metricData,metricData,etc

This way we can build a repository of the data and look at things over time for the metrics we care most about.

NOTE: Error handling, using this structure to loop through multiple applications and metric types would be huge benefits which I still need to work on, too, but would require the building of the column names and dropping of column names to be dynamic, too.
Reply
#2
I not sure whether I fully understood the question.

You may use concatenate data frames instead of the merge since you have a different number of columns.
See https://pandas.pydata.org/pandas-docs/st...rging.html

pd.concat([df,df_next], axis=0, ignore_index=True)
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Pandas Outer merge skollu826 0 48 10 hours ago
Last Post: skollu826
  Python Alteryx QS-Passing pandas dataframe column inside SQL query where condition sanky1990 0 728 Dec-04-2023, 09:48 PM
Last Post: sanky1990
  Returning Column and Row Data From Spreadsheet knight2000 0 435 Oct-22-2023, 07:07 AM
Last Post: knight2000
  Better python library to create ER Diagram by using pandas data frames as tables klllmmm 0 1,103 Oct-19-2023, 01:01 PM
Last Post: klllmmm
  Using pyodbc&pandas to load a Table data to df tester_V 3 802 Sep-09-2023, 08:55 PM
Last Post: tester_V
  pandas : problem with conditional filling of a column Xigris 2 629 Jul-22-2023, 11:44 AM
Last Post: Xigris
  Is there a more elegant way to concatenate data frames? db042190 3 919 Jun-13-2023, 05:08 PM
Last Post: snippsat
  Database that can compress a column, or all data, automatically? Calab 3 1,164 May-22-2023, 03:25 AM
Last Post: Calab
  Code for pullng all data in a column EmBeck87 5 1,105 Apr-03-2023, 03:43 PM
Last Post: deanhystad
  How to assign a value to pandas dataframe column rows based on a condition klllmmm 0 828 Sep-08-2022, 06:32 AM
Last Post: klllmmm

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020