Openpyxl with large file

LocalFolder · Oct-11-2019, 08:47 AM

Good morning my dear Python Experts,

I'm new here, I'm not an expert with coding or python and i have an issue with openpyxl.

I have some process that produce xls files with data in sheets within a particular format.
I need to use python to transform these data into another format for future uses

Here a part of my code :

import openpyxl as opxl
from openpyxl import Workbook

WorkbookSource = opxl.load_workbook(OriginFile, read_only=True)
SheetSource = WorkbookSource['data']

row_count = SheetSource.max_row

WorkbookDestination = opxl.Workbook(write_only = True)
destination_sheet=WorkbookDestination.create_sheet(title="data_fancy")

for x in range(1,row_count):
            
	DataXls = SheetSource.cell(row=x, column=1).value
	
    #
	#Things to make the world an happy place
	#
	
	destination_sheet.append(Data_fancy)

print("debug point 1")
WorkbookSource.close()
WorkbookDestination.save(filename = DestisnationFile)
print("debug point 2")

And now my problem :

The source file can contain between 1k to 70-80k lines, the code works fine when it process 1k-5k lines (process time to "debug point 1" around 5sec) but gone wild with larger files.
The process time to "debug point 1" is around 1min for 50k line(it's fine, no problem) but takes around 1 hours to "debug point 2" when it goes.

Most of the time i got the error :

Error:Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "C:\Python\lib\site-packages\openpyxl\worksheet\_writer.py", line 33, in _openpyxl_shutdown
    os.remove(path)
PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\\.....\\AppData\\Local\\Temp\\openpyxl.49ujk2e1'

I tried different thing that i find on the internet like :

- Optimised Modes for openpyxl : need to use 2 files for the processing instead of simply add a new sheet in the origin file and it's slower than without optimised file
- Use of generator to process the original data
- I try the scirpt on differents machines, same problem

As the time to "append" the corrected data is not an issue for me(1 min of processing for big files is good), I think that the issue is with the "saving" process. I didn't find relevant solution by myself, so that why I'm here.

Do you have an idea of the direction I should search ?

Have a nice day :)

LocalFolder · Oct-14-2019, 12:27 PM

Good morning,

So, I tried to split the writing and saving process of the workbook every 5k lines. I wanted to introduce "a sleeping time" between each 5k lines of data written.
Even with 30sec of sleeping time at each saving point an error occur around 29k lines written.
I will try another thing tomorrow.

Have a nice day :)

LocalFolder · Oct-15-2019, 08:32 AM

Good morning,

So, I dumped all my data into a csv file. It work fine, resulting a 16Mo file for around 42k lines of initial data.
After that I wanted to append the csv into a fresh xls file with the following code :

WorkbookDestination = opxl.Workbook(write_only = True)
destination_sheet=WorkbookDestination.create_sheet(title="data")
        
with open('/DontCrashPleaseData.csv') as f:
     reader = csv.reader(f, delimiter=',')
     for row in reader:
         destination_sheet.append(row)

WorkbookDestination.save(filename = '/Yatta.xlsx')

And now I have a memory error :

Error:Traceback (most recent call last):
  File "C:\Python\lib\xml\etree\ElementTree.py", line 836, in _get_writer
    yield file.write
  File "C:\Python\lib\xml\etree\ElementTree.py", line 777, in write
    short_empty_elements=short_empty_elements)
  File "C:\Python\lib\xml\etree\ElementTree.py", line 942, in _serialize_xml
    short_empty_elements=short_empty_elements)
  File "C:\Python\lib\xml\etree\ElementTree.py", line 942, in _serialize_xml
    short_empty_elements=short_empty_elements)
  File "C:\Python\lib\xml\etree\ElementTree.py", line 942, in _serialize_xml
    short_empty_elements=short_empty_elements)
  [Previous line repeated 1 more time]
  File "C:\Python\lib\xml\etree\ElementTree.py", line 916, in _serialize_xml
    write("<" + tag)
MemoryError

Long is the road...

Have a nice day :)

**buran** · (This post was last modified: Oct-15-2019, 08:40 AM by buran.)

check this So question https://stackoverflow.com/questions/2187...ata-excels
and suggested workarounds for incremental saving the file

LocalFolder · Oct-15-2019, 11:24 AM

Hello,

(Oct-15-2019, 08:40 AM)buran Wrote: check this So question https://stackoverflow.com/questions/2187...ata-excels
and suggested workarounds for incremental saving the file

Thank you Buran but I already tried to do incremental saving and i didn't work.
BUT I solved my problem: with using xlsxwriter.
It's pretty brut code but it works with >100k lines files with limited execution time, so I consider the job's done.

Here the relevant part of my code :

import openpyxl as opxl
from openpyxl import Workbook
from xlsxwriter.workbook import Workbook
 
WorkbookSource = opxl.load_workbook(OriginFile, read_only=True)
SheetSource = WorkbookSource['data']
 
row_count = SheetSource.max_row

#Put Data in memory using openpyxl
DataList= []
for x in range(1,row_count):
    DataXls = SheetSource.cell(row=x, column=1).value
    #
    #Things to make the world an happy place
    #
    DataList.append(Data_fancy)

WorkbookSource.close()

#saving to xls using XlsxWriter
WorkbookDestination = Workbook(DestinationFile)
destination_sheet = WorkbookDestination.add_worksheet('data')
for row in range(1,row_count-1):
    IrelevantVariable=DataList[row-1]
    for col, text in enumerate(IrelevantVariable):
        destination_sheet.write(row-1, col, text)
                
WorkbookDestination.close()

row-1 is needed because of the difference of how openpyxl and xlswriter calculate cell coordinates (from 1,1 for A1 to 0,0)

Have a nice day :)

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Problems writing a large text file in python	Vilius	4	1,022	Dec-21-2024, 09:20 AM Last Post: Pedroski55
	Converted EXE file size is too large	Rajasekaran	0	2,628	Mar-30-2023, 11:50 AM Last Post: Rajasekaran
	validate large json file with millions of records in batches	herobpv	3	2,216	Dec-10-2022, 10:36 PM Last Post: bowlofred
	Pyinstaller distribution file seems too large	hammer	4	4,944	Mar-31-2022, 02:33 PM Last Post: snippsat
	Initializing, reading and updating a large JSON file	medatib531	0	2,797	Mar-10-2022, 07:58 PM Last Post: medatib531
	Python3 doesn't populate xlsx file with openpyxl	Auldyin75	2	3,362	Feb-16-2021, 12:00 PM Last Post: Auldyin75
	can't read QRcode in large file	simoneek	0	2,019	Sep-16-2020, 08:52 AM Last Post: simoneek
	Iterate 2 large text files across lines and replace lines in second file	medatib531	13	9,001	Aug-10-2020, 11:01 PM Last Post: medatib531
	Read/Sort Large text file avoiding line-by-line read using mmep or hdf5	Robotguy	0	2,562	Jul-22-2020, 08:11 PM Last Post: Robotguy
	Loading large .csv file with pandas	hangejj	2	3,335	Jun-08-2020, 01:32 AM Last Post: hangejj

Openpyxl with large file

User Panel Messages

Announcements