[solved] how to speed-up huge data in an ascii file ?

paul18fr · (This post was last modified: May-16-2023, 08:29 PM by paul18fr.)

I've not paid any attention on that topic so far, but now I'm wondering how I can write (in a fast way) a huge amount of data in an ascii file (that's the current specification!)

The following snippet mimics of what I'm trying to do

import os, time
import numpy as np
import io
BufferSize = io.DEFAULT_BUFFER_SIZE
print (f"Default buffer size={BufferSize}")

path=str(os.getcwd())
FileName="myFile.txt"

n=1_000_000# 10_000_000
M=np.random.random((n, 3))

# without buffering
t1_0=time.time()
with open (path+'/'+FileName[:-4]+'_0.txt', 'w') as f:
    for i in range(n):
        f.write(f" X={M[i, 0]}, Y={M[i, 1]}, Z={M[i, 2]}\n")
t1_1=time.time()
print(f"With loops, duration={t1_1-t1_0}")

# with buffering
t2_0=time.time()
TestValue = 2**10
with open (path+'/'+FileName[:-4]+'_1.txt', 'w', buffering = TestValue) as f:
    for i in range(n):
        f.write(f" X={M[i, 0]}, Y={M[i, 1]}, Z={M[i, 2]}\n")
t2_1=time.time()
print(f"With buffering, duration={t2_1-t2_0}")

One can identify 2 mains issues at least:

the use of a loop
write function is called for each llop, which is time consuming

I'm currently trying to understand how to use buffering, but in practise, the value remains unclear for now; any general advice on how to write huge data in an ascii file?

Thanks

P.

**Gribouillis** · (This post was last modified: May-15-2023, 08:09 PM by Gribouillis.)

Why not invoke numpy's functions to just store the array M in a file? I mean the X=, Y=, Z= only slows things and it is not useful data.

paul18fr · May-16-2023, 09:58 AM

Hi Gribouillis

That's the specification, and the reality is a bit more complexe (I guess you're speaking about np.savetxt for instance).

I've ever thought working with "blocks" of data (arrays of string or numbers - the number of rows and columns differ for each block), but in practice I'm still looking for a way

Paul

**Gribouillis** · (This post was last modified: May-16-2023, 11:58 AM by Gribouillis.)

The problem here is that you are not only saving the data to a file, you are creating the data. It is the data creation that takes time, not the writing.

You could perhaps play with the fmt keyword argument in numpy.savetxt, something like 'X=%.18e, Y=%.18e, Z=%.18e'.

import numpy as np
import sys
n=10 # 1_000_000# 10_000_000
M=np.random.random((n, 3))

np.savetxt(sys.stdout, M, fmt='X=%.18e, Y=%.18e, Z=%.18e')

Output:X=1.778314452437265158e-01, Y=7.362842666045655848e-01, Z=8.358127207042234108e-01
X=5.591744788035918345e-01, Y=7.845951465943425962e-01, Z=6.039963855998189413e-01
X=8.327560563335355548e-01, Y=6.042091153798287984e-01, Z=1.590375469584719426e-01
X=9.855324666099820607e-01, Y=6.029884572061958714e-01, Z=3.114472999689985588e-01
X=7.433919307334269089e-01, Y=2.941350276294346644e-01, Z=6.780499010590056441e-01
X=7.791133512845780373e-01, Y=2.911042379946882086e-01, Z=8.546676365400691644e-01
X=2.481689914304145983e-01, Y=4.970687878118742464e-01, Z=1.684596602818245747e-01
X=5.134374653560572765e-01, Y=1.239447760698755285e-01, Z=3.211991817077095579e-01
X=6.812413908422187969e-02, Y=9.637812239995832142e-01, Z=8.384101532932353162e-01
X=7.187521467518211971e-01, Y=9.752591487821623550e-01, Z=1.938176050010664841e-01

paul18fr · May-16-2023, 08:36 PM

Thanks Gribouillis,

Your advice helped me a lot; based on your example, I modified it a bit in order to concatenate different data with different sizes.

import numpy as np
import os

path = str(os.getcwd())

# Creating a numpy array
n = 10 # 1_000_000# 10_000_000
M1 = np.random.random((n, 3))

m = 100
M2 = np.random.random((m, 10))

# Opening a file
with open(path + '/file.txt','w+') as f :
    
    #appending M1
    np.savetxt(f, M1, fmt='X=%.18e, Y=%.18e, Z=%.18e', header = 'blabla')
    # add of an intermediate line
    f.write("####### a comment is added\n")
    np.savetxt(f, M2)

Thanks

Paul

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Finding the median of a column in a huge CSV file	markagregory	5	3,528	Jan-24-2023, 04:22 PM Last Post: DeaD_EyE
	How to further boost the data read write speed using pandas	tjk9501	1	2,155	Nov-14-2022, 01:46 PM Last Post: jefsummers
	visualizing huge correation matrix	erdemath	3	3,185	Oct-13-2021, 09:44 AM Last Post: erdemath
	[solved] Save a matplotlib figure into hdf5 file	paul18fr	1	3,695	Jun-08-2021, 05:58 PM Last Post: paul18fr
	huge and weird values after applying some calculations	karlito	2	3,017	Dec-13-2019, 08:32 AM Last Post: karlito
	[SOLVED on SO] Downsizing non-representative data in DataFrame	volcano63	1	2,860	Sep-28-2018, 12:56 PM Last Post: volcano63
	Loading HUGE data from Python into SQL SERVER	Sandeep	2	22,700	Jan-13-2018, 07:52 AM Last Post: Sandeep

[solved] how to speed-up huge data in an ascii file ?

User Panel Messages

Announcements