Batch processing and saving into new folder

aeritano · Jun-09-2018, 09:36 AM

Hi everyone.

i am super super new to coding in python (total of 6 hours training), and i am running into a problem with my code. More precisely, i have no clue how to proceed.

Here is the situation:
I have a single folder with about 100+ excel files. Each one contains a list of measurements. All the excel sheets are in identical format (in terms of columns), but vary in number of measurements (indexes). I have created a very rough script to perform the processing I want. But right now i have to manually input the excel sheet name and change the output file name. Doing this for 100+ is rather stupid. So I want to find a way to batch process all excel files in one folder, and save the output file into a different folder.

simplified: how can I make my code "universal" and process every file in the folder, and save the output into a new folder that is created within parent folder.

The link below provides a couple sample data sets
Datasets

Here is my poorly made script that is full of useless codes and not efficient, but it works

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Fri Jun  8 12:01:07 2018

@author: anthonyeritano
"""


import pandas as pd
import numpy as np
#choose excel file
File = 'test1.xlsx'
xl_file = pd.ExcelFile(File)

#Select spreadsheet data
Data= xl_file.parse('Sheet1') #"parse" reads a sheet from the Excel file

#Select values from Columns
Skel = Data['Skeleton ID']
Brnch = Data['Branch length']
Euc = Data['Euclidean distance']
tort= Data['Branch length']/Data['Euclidean distance']

# make DataFrame (new spreadsheet) for the values
DF_3 = pd.DataFrame({'Branch Number': Skel,
                     'Brnchln': Brnch,
                     'Eucl': Euc,
                     'Tort': tort})
DF_4 = DF_3[['Branch Number','Brnchln','Eucl','Tort']]
DF_4.head()

totalNumber = len(DF_4.dropna())
dic = DF_4.to_dict()
dic2 = {}
for key in dic.keys():
    for key2 in dic[key]:
        dic2[key + ' ' + str(key2)] = dic[key][key2]


key_names = ['Brnchln','Eucl','Tort']
numbers = range(totalNumber)
titlearray=[File]
ordered_names = []
ordered_values = []
for i in range(totalNumber):
    for name in key_names:
        temp_name = name + ' ' + str(i)
        ordered_names.append(temp_name)
        ordered_values.append(dic2[temp_name])
        #print(temp_name, dic2[temp_name])

array = np.array((ordered_names, ordered_values))
array.shape
#%%
clmntitle=array[0,:]
mesur=array[1,:]
newclmntitle=np.insert(clmntitle, 0, 'Filename')
newmesur=np.insert(mesur,0,titlearray)

array3=np.array((newclmntitle,newmesur))
array3
array3.shape
#%%
import xlsxwriter

workbook = xlsxwriter.Workbook('test1 processed.xlsx')
worksheet = workbook.add_worksheet()

row=0
col=0
worksheet.write_row(row, col, newclmntitle)

row=1
col=0
worksheet.write_row(row, col, newmesur)

workbook.close()

Any help is appreciated!!!

p.s. please go easy on me.. i know my coding sucks.. lol

ljmetzger · Jun-09-2018, 08:56 PM

Your code worked OK as is. I tested the code on Windows 10 with Python 3.6.5

NOTES:
a. Constants should be all CAPS (e.g. SKEL)
b. There should be a space after the '#' in comments
c. All 'import' statements should be at the top of the file
d. Create the 'SubFolder' if necessary
e. Create a function that reads and processes the Excel File (replaces original code)
f. Process all .xlsx files (CASE INSENSITIVE) in the Data Folder
g. if __name__ == '__main__': Every module in python has a special attribute called __name__ . The value of the __name__ attribute is set to '__main__' when a module is run as the main program. Otherwise the value of __name__ is set to contain the name of the module.

Key lines in the code:

    #Line 85 - this is the baseline folder
    current_dir = pathlib.Path(".")

    #Line 94 - this defines the 'data folder' (underneath the baseline folder)
    # Use the 'tempdata' Subfolder
    # NOTE: The '/' operator creates the proper path in Windows too
    data_dir = full_path /  "tempdata"

    #Line 101 - this defines the 'processed data folder' underneath the 'data folder'
    new_data_dir = data_dir / "processed_data_folder"

    #Line 102 - this creates the 'processed data folder' underneath the 'data folder'
    # If parents is true, any missing parents of this path are created as needed (otherwise Traceback error)
    # If exist_ok is true, no Traceback error if the Subfolder already exists
    pathlib.Path(new_data_dir).mkdir(parents=False, exist_ok=True) 

    #Line 112 - this defines the 'output folder and file name combination' from 'pieces parts'
    new_name = new_data_dir / (obj.stem + " processed" + obj.suffix)

Complete code follows:

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

# Reference: https://python-forum.io/Thread-Batch-processing-and-saving-into-new-folder
# Reference: https://docs.python.org/3/library/pathlib.html

import pandas as pd
import numpy as np
import xlsxwriter
import pathlib 

# Modify the original code into a self contained function
def process_excel_file(infile, outfile):
    # print(counter, infile)
    # print(counter, outfile)
    
    xl_file = pd.ExcelFile(infile)
 
    # Select spreadsheet data
    Data= xl_file.parse('Sheet1') #"parse" reads a sheet from the Excel file
 
    #Select values from Columns
    Skel = Data['Skeleton ID']
    Brnch = Data['Branch length']
    Euc = Data['Euclidean distance']
    tort= Data['Branch length']/Data['Euclidean distance']
 
    # make DataFrame (new spreadsheet) for the values
    DF_3 = pd.DataFrame({'Branch Number': Skel,
                         'Brnchln': Brnch,
                         'Eucl': Euc,
                         'Tort': tort})
    DF_4 = DF_3[['Branch Number','Brnchln','Eucl','Tort']]
    DF_4.head()
 
    totalNumber = len(DF_4.dropna())
    dic = DF_4.to_dict()
    dic2 = {}
    for key in dic.keys():
        for key2 in dic[key]:
            dic2[key + ' ' + str(key2)] = dic[key][key2]
 
 
    key_names = ['Brnchln','Eucl','Tort']
    numbers = range(totalNumber)
    titlearray=[infile.name]
    ordered_names = []
    ordered_values = []
    for i in range(totalNumber):
        for name in key_names:
            temp_name = name + ' ' + str(i)
            ordered_names.append(temp_name)
            ordered_values.append(dic2[temp_name])
            #print(temp_name, dic2[temp_name])
 
    array = np.array((ordered_names, ordered_values))
    array.shape

    clmntitle=array[0,:]
    mesur=array[1,:]
    newclmntitle=np.insert(clmntitle, 0, 'Filename')
    newmesur=np.insert(mesur,0,titlearray)
 
    array3=np.array((newclmntitle,newmesur))
    array3
    array3.shape
 
    workbook = xlsxwriter.Workbook(outfile)
    worksheet = workbook.add_worksheet()
 
    row=0
    col=0
    worksheet.write_row(row, col, newclmntitle)
 
    row=1
    col=0
    worksheet.write_row(row, col, newmesur)
 
    workbook.close()
    return
    
if __name__ == '__main__':    

    # Get the Current Directory
    current_dir = pathlib.Path(".")
    print("Current Path: {}".format(current_dir))

    # Convert the current path to a full path name 
    full_path = current_dir.absolute()
    print("Full Path:    {}".format(full_path))

    # Use the 'tempdata' Subfolder
    # NOTE: The '/' operator creates the proper path in Windows too
    data_dir = full_path /  "tempdata"
    print("Data Path:    {}".format(data_dir))
    # print(type(data_dir)) 

    # Create the 'New Data Directory' as a subfolder of the 'tempdata' Subfolder
    # If parents is true, any missing parents of this path are created as needed (otherwise Traceback error)
    # If exist_ok is true, no Traceback error if the Subfolder already exists
    new_data_dir = data_dir / "processed_data_folder"
    pathlib.Path(new_data_dir).mkdir(parents=False, exist_ok=True) 
    print("Output Path:  {}".format(new_data_dir))
    print("")

    # iterate over all objects in the data folder (process .xlsx files only - CASE INSENSITIVE file names)
    counter = 0 
    for obj in data_dir.iterdir():
       if obj.is_file() and obj.suffix.lower() == ".xlsx":
          counter += 1
          print("Processing {:04} - {}".format(counter, obj))
          new_name = new_data_dir / (obj.stem + " processed" + obj.suffix)
          process_excel_file(obj, new_name)

Typical output:

Output:Current Path: .
Full Path:    C:\Users\Owner\Documents\Lew\WindowsBatchFiles\Python Lewis
Data Path:    C:\Users\Owner\Documents\Lew\WindowsBatchFiles\Python Lewis\tempdata
Output Path:  C:\Users\Owner\Documents\Lew\WindowsBatchFiles\Python Lewis\tempdata\processed_data_folder

Processing 0001 - C:\Users\Owner\Documents\Lew\WindowsBatchFiles\Python Lewis\tempdata\test1.xlsx
Processing 0002 - C:\Users\Owner\Documents\Lew\WindowsBatchFiles\Python Lewis\tempdata\test2.xlsX
Processing 0003 - C:\Users\Owner\Documents\Lew\WindowsBatchFiles\Python Lewis\tempdata\test3.xlsx

Lewis

volcano63 · Jun-09-2018, 09:51 PM

(Jun-09-2018, 08:56 PM)ljmetzger Wrote:

    for key in dic.keys():
        for key2 in dic[key]:
            dic2[key + ' ' + str(key2)] = dic[key][key2]

I gave the code a cursory look, and this part has drawn my attention. Iterating over dict.keys() maybe was acceptable in 2.2 or 2.3 - but for years it has been considered un-Pythonic, since iteration over dictionary already yields keys

Moreover, you can iterate over items

for key, dic2 in dic.items:
    for key2, value2 in dic2.items()
        dic2[key + ' ' + str(key2)] = value

Actually, a lot of things in your code look off - converting to dict, and then working hard to rearrange - but I am not up to the task now

aeritano · Jun-10-2018, 01:17 AM

(Jun-09-2018, 08:56 PM)ljmetzger Wrote: Your code worked OK as is. I tested the code on Windows 10 with Python 3.6.5

NOTES:
a. Constants should be all CAPS (e.g. SKEL)
b. There should be a space after the '#' in comments
c. All 'import' statements should be at the top of the file
d. Create the 'SubFolder' if necessary
e. Create a function that reads and processes the Excel File (replaces original code)
f. Process all .xlsx files (CASE INSENSITIVE) in the Data Folder
g. if __name__ == '__main__': Every module in python has a special attribute called __name__ . The value of the __name__ attribute is set to '__main__' when a module is run as the main program. Otherwise the value of __name__ is set to contain the name of the module.

Thank you! Lewis! The code works well!!! I will keep your notes in mind as I continue to learn how to code in Python!

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Miss-Classification problem with shuffled batch	Faebs94	0	1,562	Sep-02-2021, 11:55 AM Last Post: Faebs94
	Help batch converting .json chosen file to MySQL	BrandonKastning	2	2,347	Mar-14-2020, 09:19 PM Last Post: BrandonKastning
	Matplolib graphic from batch	achilledue	2	2,533	Sep-11-2019, 08:43 AM Last Post: buran
	Assign image names in folder to images in other folder.	ujjwalrathod007	15	7,811	Sep-03-2018, 05:27 PM Last Post: snippsat

Batch processing and saving into new folder

User Panel Messages

Announcements