Openpyxl with large file - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: General Coding Help (https://python-forum.io/forum-8.html) +--- Thread: Openpyxl with large file (/thread-21718.html) |
Openpyxl with large file - LocalFolder - Oct-11-2019 Good morning my dear Python Experts, I'm new here, I'm not an expert with coding or python and i have an issue with openpyxl. I have some process that produce xls files with data in sheets within a particular format. I need to use python to transform these data into another format for future uses Here a part of my code : import openpyxl as opxl from openpyxl import Workbook WorkbookSource = opxl.load_workbook(OriginFile, read_only=True) SheetSource = WorkbookSource['data'] row_count = SheetSource.max_row WorkbookDestination = opxl.Workbook(write_only = True) destination_sheet=WorkbookDestination.create_sheet(title="data_fancy") for x in range(1,row_count): DataXls = SheetSource.cell(row=x, column=1).value # #Things to make the world an happy place # destination_sheet.append(Data_fancy) print("debug point 1") WorkbookSource.close() WorkbookDestination.save(filename = DestisnationFile) print("debug point 2")And now my problem : The source file can contain between 1k to 70-80k lines, the code works fine when it process 1k-5k lines (process time to "debug point 1" around 5sec) but gone wild with larger files. The process time to "debug point 1" is around 1min for 50k line(it's fine, no problem) but takes around 1 hours to "debug point 2" when it goes. Most of the time i got the error : I tried different thing that i find on the internet like :- Optimised Modes for openpyxl : need to use 2 files for the processing instead of simply add a new sheet in the origin file and it's slower than without optimised file - Use of generator to process the original data - I try the scirpt on differents machines, same problem As the time to "append" the corrected data is not an issue for me(1 min of processing for big files is good), I think that the issue is with the "saving" process. I didn't find relevant solution by myself, so that why I'm here. Do you have an idea of the direction I should search ? Have a nice day :) RE: Openpyxl with large file - LocalFolder - Oct-14-2019 Good morning, So, I tried to split the writing and saving process of the workbook every 5k lines. I wanted to introduce "a sleeping time" between each 5k lines of data written. Even with 30sec of sleeping time at each saving point an error occur around 29k lines written. I will try another thing tomorrow. Have a nice day :) RE: Openpyxl with large file - LocalFolder - Oct-15-2019 Good morning, So, I dumped all my data into a csv file. It work fine, resulting a 16Mo file for around 42k lines of initial data. After that I wanted to append the csv into a fresh xls file with the following code : WorkbookDestination = opxl.Workbook(write_only = True) destination_sheet=WorkbookDestination.create_sheet(title="data") with open('/DontCrashPleaseData.csv') as f: reader = csv.reader(f, delimiter=',') for row in reader: destination_sheet.append(row) WorkbookDestination.save(filename = '/Yatta.xlsx')And now I have a memory error : Long is the road... Have a nice day :) RE: Openpyxl with large file - buran - Oct-15-2019 check this So question https://stackoverflow.com/questions/21875249/memory-error-using-openpyxl-and-large-data-excels and suggested workarounds for incremental saving the file RE: Openpyxl with large file - LocalFolder - Oct-15-2019 Hello, (Oct-15-2019, 08:40 AM)buran Wrote: check this So question https://stackoverflow.com/questions/21875249/memory-error-using-openpyxl-and-large-data-excels Thank you Buran but I already tried to do incremental saving and i didn't work. BUT I solved my problem: with using xlsxwriter. It's pretty brut code but it works with >100k lines files with limited execution time, so I consider the job's done. Here the relevant part of my code : import openpyxl as opxl from openpyxl import Workbook from xlsxwriter.workbook import Workbook WorkbookSource = opxl.load_workbook(OriginFile, read_only=True) SheetSource = WorkbookSource['data'] row_count = SheetSource.max_row #Put Data in memory using openpyxl DataList= [] for x in range(1,row_count): DataXls = SheetSource.cell(row=x, column=1).value # #Things to make the world an happy place # DataList.append(Data_fancy) WorkbookSource.close() #saving to xls using XlsxWriter WorkbookDestination = Workbook(DestinationFile) destination_sheet = WorkbookDestination.add_worksheet('data') for row in range(1,row_count-1): IrelevantVariable=DataList[row-1] for col, text in enumerate(IrelevantVariable): destination_sheet.write(row-1, col, text) WorkbookDestination.close()row-1 is needed because of the difference of how openpyxl and xlswriter calculate cell coordinates (from 1,1 for A1 to 0,0) Have a nice day :) |