Openpyxl with large file - Printable Version

Openpyxl with large file - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: General Coding Help (https://python-forum.io/forum-8.html)
+--- Thread: Openpyxl with large file (/thread-21718.html)

Openpyxl with large file - LocalFolder - Oct-11-2019

Good morning my dear Python Experts,

I'm new here, I'm not an expert with coding or python and i have an issue with openpyxl.

I have some process that produce xls files with data in sheets within a particular format.
I need to use python to transform these data into another format for future uses

Here a part of my code :

import openpyxl as opxl
from openpyxl import Workbook

WorkbookSource = opxl.load_workbook(OriginFile, read_only=True)
SheetSource = WorkbookSource['data']

row_count = SheetSource.max_row

WorkbookDestination = opxl.Workbook(write_only = True)
destination_sheet=WorkbookDestination.create_sheet(title="data_fancy")

for x in range(1,row_count):
            
	DataXls = SheetSource.cell(row=x, column=1).value
	
    #
	#Things to make the world an happy place
	#
	
	destination_sheet.append(Data_fancy)

print("debug point 1")
WorkbookSource.close()
WorkbookDestination.save(filename = DestisnationFile)
print("debug point 2")

And now my problem :

The source file can contain between 1k to 70-80k lines, the code works fine when it process 1k-5k lines (process time to "debug point 1" around 5sec) but gone wild with larger files.
The process time to "debug point 1" is around 1min for 50k line(it's fine, no problem) but takes around 1 hours to "debug point 2" when it goes.

Most of the time i got the error :

Error:Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "C:\Python\lib\site-packages\openpyxl\worksheet\_writer.py", line 33, in _openpyxl_shutdown
    os.remove(path)
PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\\.....\\AppData\\Local\\Temp\\openpyxl.49ujk2e1'

I tried different thing that i find on the internet like :

- Optimised Modes for openpyxl : need to use 2 files for the processing instead of simply add a new sheet in the origin file and it's slower than without optimised file
- Use of generator to process the original data
- I try the scirpt on differents machines, same problem

As the time to "append" the corrected data is not an issue for me(1 min of processing for big files is good), I think that the issue is with the "saving" process. I didn't find relevant solution by myself, so that why I'm here.

Do you have an idea of the direction I should search ?

Have a nice day :)

RE: Openpyxl with large file - LocalFolder - Oct-14-2019

Good morning,

So, I tried to split the writing and saving process of the workbook every 5k lines. I wanted to introduce "a sleeping time" between each 5k lines of data written.
Even with 30sec of sleeping time at each saving point an error occur around 29k lines written.
I will try another thing tomorrow.

Have a nice day :)

RE: Openpyxl with large file - LocalFolder - Oct-15-2019

Good morning,

So, I dumped all my data into a csv file. It work fine, resulting a 16Mo file for around 42k lines of initial data.
After that I wanted to append the csv into a fresh xls file with the following code :

WorkbookDestination = opxl.Workbook(write_only = True)
destination_sheet=WorkbookDestination.create_sheet(title="data")
        
with open('/DontCrashPleaseData.csv') as f:
     reader = csv.reader(f, delimiter=',')
     for row in reader:
         destination_sheet.append(row)

WorkbookDestination.save(filename = '/Yatta.xlsx')

And now I have a memory error :

Error:Traceback (most recent call last):
  File "C:\Python\lib\xml\etree\ElementTree.py", line 836, in _get_writer
    yield file.write
  File "C:\Python\lib\xml\etree\ElementTree.py", line 777, in write
    short_empty_elements=short_empty_elements)
  File "C:\Python\lib\xml\etree\ElementTree.py", line 942, in _serialize_xml
    short_empty_elements=short_empty_elements)
  File "C:\Python\lib\xml\etree\ElementTree.py", line 942, in _serialize_xml
    short_empty_elements=short_empty_elements)
  File "C:\Python\lib\xml\etree\ElementTree.py", line 942, in _serialize_xml
    short_empty_elements=short_empty_elements)
  [Previous line repeated 1 more time]
  File "C:\Python\lib\xml\etree\ElementTree.py", line 916, in _serialize_xml
    write("<" + tag)
MemoryError

Long is the road...

Have a nice day :)

RE: Openpyxl with large file - buran - Oct-15-2019

check this So question https://stackoverflow.com/questions/21875249/memory-error-using-openpyxl-and-large-data-excels
and suggested workarounds for incremental saving the file

RE: Openpyxl with large file - LocalFolder - Oct-15-2019

Hello,

(Oct-15-2019, 08:40 AM)buran Wrote: check this So question https://stackoverflow.com/questions/21875249/memory-error-using-openpyxl-and-large-data-excels
and suggested workarounds for incremental saving the file

Thank you Buran but I already tried to do incremental saving and i didn't work.
BUT I solved my problem: with using xlsxwriter.
It's pretty brut code but it works with >100k lines files with limited execution time, so I consider the job's done.

Here the relevant part of my code :

import openpyxl as opxl
from openpyxl import Workbook
from xlsxwriter.workbook import Workbook
 
WorkbookSource = opxl.load_workbook(OriginFile, read_only=True)
SheetSource = WorkbookSource['data']
 
row_count = SheetSource.max_row

#Put Data in memory using openpyxl
DataList= []
for x in range(1,row_count):
    DataXls = SheetSource.cell(row=x, column=1).value
    #
    #Things to make the world an happy place
    #
    DataList.append(Data_fancy)

WorkbookSource.close()

#saving to xls using XlsxWriter
WorkbookDestination = Workbook(DestinationFile)
destination_sheet = WorkbookDestination.add_worksheet('data')
for row in range(1,row_count-1):
    IrelevantVariable=DataList[row-1]
    for col, text in enumerate(IrelevantVariable):
        destination_sheet.write(row-1, col, text)
                
WorkbookDestination.close()

row-1 is needed because of the difference of how openpyxl and xlswriter calculate cell coordinates (from 1,1 for A1 to 0,0)

Have a nice day :)