Python Forum
[solved] how to speed-up huge data in an ascii file ?
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
[solved] how to speed-up huge data in an ascii file ?
#1
I've not paid any attention on that topic so far, but now I'm wondering how I can write (in a fast way) a huge amount of data in an ascii file (that's the current specification!)

The following snippet mimics of what I'm trying to do

import os, time
import numpy as np
import io
BufferSize = io.DEFAULT_BUFFER_SIZE
print (f"Default buffer size={BufferSize}")

path=str(os.getcwd())
FileName="myFile.txt"

n=1_000_000# 10_000_000
M=np.random.random((n, 3))

# without buffering
t1_0=time.time()
with open (path+'/'+FileName[:-4]+'_0.txt', 'w') as f:
    for i in range(n):
        f.write(f" X={M[i, 0]}, Y={M[i, 1]}, Z={M[i, 2]}\n")
t1_1=time.time()
print(f"With loops, duration={t1_1-t1_0}")

# with buffering
t2_0=time.time()
TestValue = 2**10
with open (path+'/'+FileName[:-4]+'_1.txt', 'w', buffering = TestValue) as f:
    for i in range(n):
        f.write(f" X={M[i, 0]}, Y={M[i, 1]}, Z={M[i, 2]}\n")
t2_1=time.time()
print(f"With buffering, duration={t2_1-t2_0}")
One can identify 2 mains issues at least:
  1. the use of a loop
  2. write function is called for each llop, which is time consuming

I'm currently trying to understand how to use buffering, but in practise, the value remains unclear for now; any general advice on how to write huge data in an ascii file?

Thanks

P.
Reply
#2
Why not invoke numpy's functions to just store the array M in a file? I mean the X=, Y=, Z= only slows things and it is not useful data.
Reply
#3
Hi Gribouillis

That's the specification, and the reality is a bit more complexe (I guess you're speaking about np.savetxt for instance).

I've ever thought working with "blocks" of data (arrays of string or numbers - the number of rows and columns differ for each block), but in practice I'm still looking for a way

Paul
Reply
#4
The problem here is that you are not only saving the data to a file, you are creating the data. It is the data creation that takes time, not the writing.

You could perhaps play with the fmt keyword argument in numpy.savetxt, something like 'X=%.18e, Y=%.18e, Z=%.18e'.
import numpy as np
import sys
n=10 # 1_000_000# 10_000_000
M=np.random.random((n, 3))

np.savetxt(sys.stdout, M, fmt='X=%.18e, Y=%.18e, Z=%.18e')
Output:
X=1.778314452437265158e-01, Y=7.362842666045655848e-01, Z=8.358127207042234108e-01 X=5.591744788035918345e-01, Y=7.845951465943425962e-01, Z=6.039963855998189413e-01 X=8.327560563335355548e-01, Y=6.042091153798287984e-01, Z=1.590375469584719426e-01 X=9.855324666099820607e-01, Y=6.029884572061958714e-01, Z=3.114472999689985588e-01 X=7.433919307334269089e-01, Y=2.941350276294346644e-01, Z=6.780499010590056441e-01 X=7.791133512845780373e-01, Y=2.911042379946882086e-01, Z=8.546676365400691644e-01 X=2.481689914304145983e-01, Y=4.970687878118742464e-01, Z=1.684596602818245747e-01 X=5.134374653560572765e-01, Y=1.239447760698755285e-01, Z=3.211991817077095579e-01 X=6.812413908422187969e-02, Y=9.637812239995832142e-01, Z=8.384101532932353162e-01 X=7.187521467518211971e-01, Y=9.752591487821623550e-01, Z=1.938176050010664841e-01
Reply
#5
Thanks Gribouillis,

Your advice helped me a lot; based on your example, I modified it a bit in order to concatenate different data with different sizes.

import numpy as np
import os

path = str(os.getcwd())

# Creating a numpy array
n = 10 # 1_000_000# 10_000_000
M1 = np.random.random((n, 3))

m = 100
M2 = np.random.random((m, 10))

# Opening a file
with open(path + '/file.txt','w+') as f :
    
    #appending M1
    np.savetxt(f, M1, fmt='X=%.18e, Y=%.18e, Z=%.18e', header = 'blabla')
    # add of an intermediate line
    f.write("####### a comment is added\n")
    np.savetxt(f, M2)
Thanks

Paul
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Finding the median of a column in a huge CSV file markagregory 5 2,329 Jan-24-2023, 04:22 PM
Last Post: DeaD_EyE
Smile How to further boost the data read write speed using pandas tjk9501 1 1,567 Nov-14-2022, 01:46 PM
Last Post: jefsummers
  visualizing huge correation matrix erdemath 3 2,462 Oct-13-2021, 09:44 AM
Last Post: erdemath
  [solved] Save a matplotlib figure into hdf5 file paul18fr 1 2,865 Jun-08-2021, 05:58 PM
Last Post: paul18fr
  huge and weird values after applying some calculations karlito 2 2,452 Dec-13-2019, 08:32 AM
Last Post: karlito
  [SOLVED on SO] Downsizing non-representative data in DataFrame volcano63 1 2,411 Sep-28-2018, 12:56 PM
Last Post: volcano63
  Loading HUGE data from Python into SQL SERVER Sandeep 2 21,577 Jan-13-2018, 07:52 AM
Last Post: Sandeep

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020