Batch processing and saving into new folder - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: Data Science (https://python-forum.io/forum-44.html) +--- Thread: Batch processing and saving into new folder (/thread-10841.html) |
Batch processing and saving into new folder - aeritano - Jun-09-2018 Hi everyone. i am super super new to coding in python (total of 6 hours training), and i am running into a problem with my code. More precisely, i have no clue how to proceed. Here is the situation: I have a single folder with about 100+ excel files. Each one contains a list of measurements. All the excel sheets are in identical format (in terms of columns), but vary in number of measurements (indexes). I have created a very rough script to perform the processing I want. But right now i have to manually input the excel sheet name and change the output file name. Doing this for 100+ is rather stupid. So I want to find a way to batch process all excel files in one folder, and save the output file into a different folder. simplified: how can I make my code "universal" and process every file in the folder, and save the output into a new folder that is created within parent folder. The link below provides a couple sample data sets Datasets Here is my poorly made script that is full of useless codes and not efficient, but it works #!/usr/bin/env python3 # -*- coding: utf-8 -*- """ Created on Fri Jun 8 12:01:07 2018 @author: anthonyeritano """ import pandas as pd import numpy as np #choose excel file File = 'test1.xlsx' xl_file = pd.ExcelFile(File) #Select spreadsheet data Data= xl_file.parse('Sheet1') #"parse" reads a sheet from the Excel file #Select values from Columns Skel = Data['Skeleton ID'] Brnch = Data['Branch length'] Euc = Data['Euclidean distance'] tort= Data['Branch length']/Data['Euclidean distance'] # make DataFrame (new spreadsheet) for the values DF_3 = pd.DataFrame({'Branch Number': Skel, 'Brnchln': Brnch, 'Eucl': Euc, 'Tort': tort}) DF_4 = DF_3[['Branch Number','Brnchln','Eucl','Tort']] DF_4.head() totalNumber = len(DF_4.dropna()) dic = DF_4.to_dict() dic2 = {} for key in dic.keys(): for key2 in dic[key]: dic2[key + ' ' + str(key2)] = dic[key][key2] key_names = ['Brnchln','Eucl','Tort'] numbers = range(totalNumber) titlearray=[File] ordered_names = [] ordered_values = [] for i in range(totalNumber): for name in key_names: temp_name = name + ' ' + str(i) ordered_names.append(temp_name) ordered_values.append(dic2[temp_name]) #print(temp_name, dic2[temp_name]) array = np.array((ordered_names, ordered_values)) array.shape #%% clmntitle=array[0,:] mesur=array[1,:] newclmntitle=np.insert(clmntitle, 0, 'Filename') newmesur=np.insert(mesur,0,titlearray) array3=np.array((newclmntitle,newmesur)) array3 array3.shape #%% import xlsxwriter workbook = xlsxwriter.Workbook('test1 processed.xlsx') worksheet = workbook.add_worksheet() row=0 col=0 worksheet.write_row(row, col, newclmntitle) row=1 col=0 worksheet.write_row(row, col, newmesur) workbook.close()Any help is appreciated!!! p.s. please go easy on me.. i know my coding sucks.. lol RE: Batch processing and saving into new folder - ljmetzger - Jun-09-2018 Your code worked OK as is. I tested the code on Windows 10 with Python 3.6.5 NOTES: a. Constants should be all CAPS (e.g. SKEL) b. There should be a space after the '#' in comments c. All 'import' statements should be at the top of the file d. Create the 'SubFolder' if necessary e. Create a function that reads and processes the Excel File (replaces original code) f. Process all .xlsx files (CASE INSENSITIVE) in the Data Folder g. if __name__ == '__main__': Every module in python has a special attribute called __name__ . The value of the __name__ attribute is set to '__main__' when a module is run as the main program. Otherwise the value of __name__ is set to contain the name of the module.Key lines in the code: #Line 85 - this is the baseline folder current_dir = pathlib.Path(".") #Line 94 - this defines the 'data folder' (underneath the baseline folder) # Use the 'tempdata' Subfolder # NOTE: The '/' operator creates the proper path in Windows too data_dir = full_path / "tempdata" #Line 101 - this defines the 'processed data folder' underneath the 'data folder' new_data_dir = data_dir / "processed_data_folder" #Line 102 - this creates the 'processed data folder' underneath the 'data folder' # If parents is true, any missing parents of this path are created as needed (otherwise Traceback error) # If exist_ok is true, no Traceback error if the Subfolder already exists pathlib.Path(new_data_dir).mkdir(parents=False, exist_ok=True) #Line 112 - this defines the 'output folder and file name combination' from 'pieces parts' new_name = new_data_dir / (obj.stem + " processed" + obj.suffix)Complete code follows: #!/usr/bin/env python3 # -*- coding: utf-8 -*- # Reference: https://python-forum.io/Thread-Batch-processing-and-saving-into-new-folder # Reference: https://docs.python.org/3/library/pathlib.html import pandas as pd import numpy as np import xlsxwriter import pathlib # Modify the original code into a self contained function def process_excel_file(infile, outfile): # print(counter, infile) # print(counter, outfile) xl_file = pd.ExcelFile(infile) # Select spreadsheet data Data= xl_file.parse('Sheet1') #"parse" reads a sheet from the Excel file #Select values from Columns Skel = Data['Skeleton ID'] Brnch = Data['Branch length'] Euc = Data['Euclidean distance'] tort= Data['Branch length']/Data['Euclidean distance'] # make DataFrame (new spreadsheet) for the values DF_3 = pd.DataFrame({'Branch Number': Skel, 'Brnchln': Brnch, 'Eucl': Euc, 'Tort': tort}) DF_4 = DF_3[['Branch Number','Brnchln','Eucl','Tort']] DF_4.head() totalNumber = len(DF_4.dropna()) dic = DF_4.to_dict() dic2 = {} for key in dic.keys(): for key2 in dic[key]: dic2[key + ' ' + str(key2)] = dic[key][key2] key_names = ['Brnchln','Eucl','Tort'] numbers = range(totalNumber) titlearray=[infile.name] ordered_names = [] ordered_values = [] for i in range(totalNumber): for name in key_names: temp_name = name + ' ' + str(i) ordered_names.append(temp_name) ordered_values.append(dic2[temp_name]) #print(temp_name, dic2[temp_name]) array = np.array((ordered_names, ordered_values)) array.shape clmntitle=array[0,:] mesur=array[1,:] newclmntitle=np.insert(clmntitle, 0, 'Filename') newmesur=np.insert(mesur,0,titlearray) array3=np.array((newclmntitle,newmesur)) array3 array3.shape workbook = xlsxwriter.Workbook(outfile) worksheet = workbook.add_worksheet() row=0 col=0 worksheet.write_row(row, col, newclmntitle) row=1 col=0 worksheet.write_row(row, col, newmesur) workbook.close() return if __name__ == '__main__': # Get the Current Directory current_dir = pathlib.Path(".") print("Current Path: {}".format(current_dir)) # Convert the current path to a full path name full_path = current_dir.absolute() print("Full Path: {}".format(full_path)) # Use the 'tempdata' Subfolder # NOTE: The '/' operator creates the proper path in Windows too data_dir = full_path / "tempdata" print("Data Path: {}".format(data_dir)) # print(type(data_dir)) # Create the 'New Data Directory' as a subfolder of the 'tempdata' Subfolder # If parents is true, any missing parents of this path are created as needed (otherwise Traceback error) # If exist_ok is true, no Traceback error if the Subfolder already exists new_data_dir = data_dir / "processed_data_folder" pathlib.Path(new_data_dir).mkdir(parents=False, exist_ok=True) print("Output Path: {}".format(new_data_dir)) print("") # iterate over all objects in the data folder (process .xlsx files only - CASE INSENSITIVE file names) counter = 0 for obj in data_dir.iterdir(): if obj.is_file() and obj.suffix.lower() == ".xlsx": counter += 1 print("Processing {:04} - {}".format(counter, obj)) new_name = new_data_dir / (obj.stem + " processed" + obj.suffix) process_excel_file(obj, new_name)Typical output: Lewis
RE: Batch processing and saving into new folder - volcano63 - Jun-09-2018 (Jun-09-2018, 08:56 PM)ljmetzger Wrote:for key in dic.keys(): for key2 in dic[key]: dic2[key + ' ' + str(key2)] = dic[key][key2] I gave the code a cursory look, and this part has drawn my attention. Iterating over dict.keys() maybe was acceptable in 2.2 or 2.3 - but for years it has been considered un-Pythonic, since iteration over dictionary already yields keysMoreover, you can iterate over items for key, dic2 in dic.items: for key2, value2 in dic2.items() dic2[key + ' ' + str(key2)] = valueActually, a lot of things in your code look off - converting to dict , and then working hard to rearrange - but I am not up to the task now
RE: Batch processing and saving into new folder - aeritano - Jun-10-2018 (Jun-09-2018, 08:56 PM)ljmetzger Wrote: Your code worked OK as is. I tested the code on Windows 10 with Python 3.6.5 Thank you! Lewis! The code works well!!! I will keep your notes in mind as I continue to learn how to code in Python! |