Python Forum
large csv to many xlsx containing multiple tabs - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: Data Science (https://python-forum.io/forum-44.html)
+--- Thread: large csv to many xlsx containing multiple tabs (/thread-14225.html)



large csv to many xlsx containing multiple tabs - thatIsTheCase - Nov-20-2018

I need to automate the creation of around 1000 excel workbooks. The workbook will contain three tabs.

TAB 1 = Overview and instructions for fund holders. Will be the same for all xlsx.
TAB 2 = Grant Transactions
TAB 3 = Gift Transactions

I have two large csv files, one for each type of transaction (Gift/Grant).

Gist:

Open (as pandas.dataFrame) a static csv file containing transaction history for around 1000 fund accounts;
Save headers as variables to be used on all workbooks;
Use groupby on csv['fund_id'] and apply the unique values to a list.
Iterate through df using list values.
for each: initiate/open workbook object as ('Fund_nm'_'Fund_ID'.csv), initiate/open worksheet object with name 'Grants', write headers, write row values, close (save) worksheet.

I would like to write one excel workbook (.xlsx)for each fund account.
This means I will either have to write all three worksheets at once (intro tab, grant tab, gifts tab) or I will have to update xlsx when creating sheets two and three.

I have used xlsxWriter and pyexcel to write single workbooks. I am needing some help thinking through how to accomplish this task.

Objective: create ~1000 excel workbooks containing intro tab (constant across workbooks), grants tab, and gifts tab. I currently have the data in two csv files.

Any help thinking through the logical steps and/or recommending approaches to the task would be greatly appreciated.


RE: large csv to many xlsx containing multiple tabs - thatIsTheCase - Nov-23-2018

df1 = pd.read_csv('Grant_Transactions.csv')
fund_ids = list(df1['Account ID'].values)

for fund_id in fund_ids:
    df2  = df1[df1['Account ID'] == fund_id]

    writer = pd.ExcelWriter(str(fund) + '.xlsx', engine='xlsxwriter')
    df2.to_excel(writer, 'Sheet1')
    workbook = writer.book
    worksheet = writer.sheets['Sheet1']
I have managed to figure out how to sucessfully loop through to create multiple xlsx files. This code generates an excel file per unique ID.

ISSUES:
1. This code takes a long time to generate the 1700+ xlsx files. Is there are better way?
2. Given 1, I am concerned about adding the second tab (gift contribution transactions).

I need to figure out best way to do:
for fund_id in df2, where fund_id is equal to fund_id in df4(df3 is the pd.read_csv of 2nd csv), write df2.to_excel in sheet 1 and write df4 to sheet 2.


RE: large csv to many xlsx containing multiple tabs - thatIsTheCase - Nov-23-2018

df1 = pd.read_csv('Grant_Transactions.csv')
fund_ids = list(df1['Account ID'].values)
 
for fund_id in fund_ids:
    df2  = df1[df1['Account ID'] == fund_id]
 
    writer = pd.ExcelWriter(str(fund) + '.xlsx', engine='xlsxwriter')
    df2.to_excel(writer, 'Sheet1')
    workbook = writer.book
    worksheet = writer.sheets['Sheet1']

    workbook.close()
Added workbook.close() to end but still is very slow to write 1700 xlsx.


RE: large csv to many xlsx containing multiple tabs - thatIsTheCase - Nov-27-2018

import pandas as pd

df0 = pd.read_csv('test_sample.csv', usecols=
                    ('typeCategory','tran_id',
                    'issue_d','disbursement_d','fund_id','fund_nm', 'amount','grantee_legal_nm'))
df00 = pd.read_csv('test_sample.csv', usecols=
                    ('typeCategory','tran_id',
                    'issue_d','fund_id','fund_nm', 'amount'))

df1 = df0[df0['typeCategory'] == 'grant']
df2 = df00[df00['typeCategory'] == 'gift']


grantingFundID = list(df0['fund_id'].values)
#cell_format = workbook.add_format()

#df1.columns

image_file = open('logo1.png', 'rb')
image_data = io.BytesIO(image_file.read())

for fund_id in grantingFundID:
    df3= df1[df1['fund_id'] == fund_id]
    df4= df2[df2['fund_id'] == fund_id]

    writer =  pd.ExcelWriter(str(fund_id)+'.xlsx', engine='xlsxwriter')
    df3.to_excel(writer, 'grants',index=False)
    df4.to_excel(writer, 'gifts',index=False)

    workbook = writer.book
    worksheet = writer.sheets['grants']
    worksheet.set_margins(top=1.3)
    worksheet.set_header('&L&G', {
                                  'image_left': 'logo.png',
                                  'image_data_left': image_data
                                  })
    worksheet.set_column(0,30,20)
    worksheet = writer.sheets['gifts']
    workbook.close()
Here is updated code. I am now adding header to xlsx (logo1.png). I have not added the overview tab. Posting here to save someone else the time in the future.