Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Openpyxl with large file
#1
Good morning my dear Python Experts,

I'm new here, I'm not an expert with coding or python and i have an issue with openpyxl.

I have some process that produce xls files with data in sheets within a particular format.
I need to use python to transform these data into another format for future uses

Here a part of my code :

import openpyxl as opxl
from openpyxl import Workbook

WorkbookSource = opxl.load_workbook(OriginFile, read_only=True)
SheetSource = WorkbookSource['data']

row_count = SheetSource.max_row

WorkbookDestination = opxl.Workbook(write_only = True)
destination_sheet=WorkbookDestination.create_sheet(title="data_fancy")

for x in range(1,row_count):
            
	DataXls = SheetSource.cell(row=x, column=1).value
	
    #
	#Things to make the world an happy place
	#
	
	destination_sheet.append(Data_fancy)

print("debug point 1")
WorkbookSource.close()
WorkbookDestination.save(filename = DestisnationFile)
print("debug point 2")
And now my problem :

The source file can contain between 1k to 70-80k lines, the code works fine when it process 1k-5k lines (process time to "debug point 1" around 5sec) but gone wild with larger files.
The process time to "debug point 1" is around 1min for 50k line(it's fine, no problem) but takes around 1 hours to "debug point 2" when it goes.

Most of the time i got the error :
Error:
Error in atexit._run_exitfuncs: Traceback (most recent call last): File "C:\Python\lib\site-packages\openpyxl\worksheet\_writer.py", line 33, in _openpyxl_shutdown os.remove(path) PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\\.....\\AppData\\Local\\Temp\\openpyxl.49ujk2e1'
I tried different thing that i find on the internet like :

- Optimised Modes for openpyxl : need to use 2 files for the processing instead of simply add a new sheet in the origin file and it's slower than without optimised file
- Use of generator to process the original data
- I try the scirpt on differents machines, same problem

As the time to "append" the corrected data is not an issue for me(1 min of processing for big files is good), I think that the issue is with the "saving" process. I didn't find relevant solution by myself, so that why I'm here.

Do you have an idea of the direction I should search ?

Have a nice day :)
Reply
#2
Good morning,

So, I tried to split the writing and saving process of the workbook every 5k lines. I wanted to introduce "a sleeping time" between each 5k lines of data written.
Even with 30sec of sleeping time at each saving point an error occur around 29k lines written.
I will try another thing tomorrow.

Have a nice day :)
Reply
#3
Good morning,

So, I dumped all my data into a csv file. It work fine, resulting a 16Mo file for around 42k lines of initial data.
After that I wanted to append the csv into a fresh xls file with the following code :

WorkbookDestination = opxl.Workbook(write_only = True)
destination_sheet=WorkbookDestination.create_sheet(title="data")
        
with open('/DontCrashPleaseData.csv') as f:
     reader = csv.reader(f, delimiter=',')
     for row in reader:
         destination_sheet.append(row)

WorkbookDestination.save(filename = '/Yatta.xlsx')
And now I have a memory error :

Error:
Traceback (most recent call last): File "C:\Python\lib\xml\etree\ElementTree.py", line 836, in _get_writer yield file.write File "C:\Python\lib\xml\etree\ElementTree.py", line 777, in write short_empty_elements=short_empty_elements) File "C:\Python\lib\xml\etree\ElementTree.py", line 942, in _serialize_xml short_empty_elements=short_empty_elements) File "C:\Python\lib\xml\etree\ElementTree.py", line 942, in _serialize_xml short_empty_elements=short_empty_elements) File "C:\Python\lib\xml\etree\ElementTree.py", line 942, in _serialize_xml short_empty_elements=short_empty_elements) [Previous line repeated 1 more time] File "C:\Python\lib\xml\etree\ElementTree.py", line 916, in _serialize_xml write("<" + tag) MemoryError
Long is the road...

Have a nice day :)
Reply
#4
check this So question https://stackoverflow.com/questions/2187...ata-excels
and suggested workarounds for incremental saving the file
If you can't explain it to a six year old, you don't understand it yourself, Albert Einstein
How to Ask Questions The Smart Way: link and another link
Create MCV example
Debug small programs

Reply
#5
Hello,

(Oct-15-2019, 08:40 AM)buran Wrote: check this So question https://stackoverflow.com/questions/2187...ata-excels
and suggested workarounds for incremental saving the file

Thank you Buran but I already tried to do incremental saving and i didn't work.
BUT I solved my problem: with using xlsxwriter.
It's pretty brut code but it works with >100k lines files with limited execution time, so I consider the job's done.

Here the relevant part of my code :

import openpyxl as opxl
from openpyxl import Workbook
from xlsxwriter.workbook import Workbook
 
WorkbookSource = opxl.load_workbook(OriginFile, read_only=True)
SheetSource = WorkbookSource['data']
 
row_count = SheetSource.max_row

#Put Data in memory using openpyxl
DataList= []
for x in range(1,row_count):
    DataXls = SheetSource.cell(row=x, column=1).value
    #
    #Things to make the world an happy place
    #
    DataList.append(Data_fancy)

WorkbookSource.close()

#saving to xls using XlsxWriter
WorkbookDestination = Workbook(DestinationFile)
destination_sheet = WorkbookDestination.add_worksheet('data')
for row in range(1,row_count-1):
    IrelevantVariable=DataList[row-1]
    for col, text in enumerate(IrelevantVariable):
        destination_sheet.write(row-1, col, text)
                
WorkbookDestination.close()
row-1 is needed because of the difference of how openpyxl and xlswriter calculate cell coordinates (from 1,1 for A1 to 0,0)

Have a nice day :)
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Converted EXE file size is too large Rajasekaran 0 1,496 Mar-30-2023, 11:50 AM
Last Post: Rajasekaran
  validate large json file with millions of records in batches herobpv 3 1,251 Dec-10-2022, 10:36 PM
Last Post: bowlofred
  Pyinstaller distribution file seems too large hammer 4 2,694 Mar-31-2022, 02:33 PM
Last Post: snippsat
  Initializing, reading and updating a large JSON file medatib531 0 1,760 Mar-10-2022, 07:58 PM
Last Post: medatib531
  Python3 doesn't populate xlsx file with openpyxl Auldyin75 2 2,519 Feb-16-2021, 12:00 PM
Last Post: Auldyin75
  can't read QRcode in large file simoneek 0 1,494 Sep-16-2020, 08:52 AM
Last Post: simoneek
  Iterate 2 large text files across lines and replace lines in second file medatib531 13 5,766 Aug-10-2020, 11:01 PM
Last Post: medatib531
  Read/Sort Large text file avoiding line-by-line read using mmep or hdf5 Robotguy 0 2,042 Jul-22-2020, 08:11 PM
Last Post: Robotguy
  Loading large .csv file with pandas hangejj 2 2,364 Jun-08-2020, 01:32 AM
Last Post: hangejj
  openpyxl - How can I copy some row from an excel file and paste them in another one? aquerci 2 17,461 Jun-07-2020, 04:26 AM
Last Post: buran

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020