Jul-31-2020, 06:59 PM
Here are the answers to your questions:
You want to subtract two vectors of size N; What is the order of N? N = 10000, 100000? or 10^9. what is the type of data to be subtracted? integer, double, how many bytes per each value?
N (no. of rows reaches 10^9 for each file) contains float64 numbers
Lets imagine likely the most efficient way to make subtraction of these large arrays. We store the first array (suppose 8 bytes per element) into binary file. The second array is stored in another file. We assume that these files are large and we cannot load any of them into memory.
That's right, we can't load the file all at one. I tried np.loadtxt and it failed!
Theoretically, we can write a program, e.g. in C, that reads these both files by chunks (since each element 8 bytes, we can read , e.g. 8*10^6 bytes at a time), do computation with these chunks, and put the result into another binary file. That would be very efficient approach; No Python, no any additional heavy libraries (like pandas, numpy etc), no overkills related with hdf-format! The bottleneck would be i/o operations, how fast is your hdd; is it ssd?! Finally, you can convert output binary file into hdf-file, if needed...
So, is this way appropriate for you? What about your hardware?
The subtraction has to be performed in Python. I know chunking could help. In fact I tried reading a chunk (N=10^6) from first file subtracting each of the chunk's element from the chunk of second file. But that still takes time as I have to grab each element of chunk of file-1 using for loop.
See my progress below; I used memory mapping as well. It is efficient if I do not do any subtraction and just go through the chunks. The "for j in range(m):" is the one that is inefficient. This is when I am just subtracting and not saving the difference. Any better logic/implementation you can think of?
You want to subtract two vectors of size N; What is the order of N? N = 10000, 100000? or 10^9. what is the type of data to be subtracted? integer, double, how many bytes per each value?
N (no. of rows reaches 10^9 for each file) contains float64 numbers
Lets imagine likely the most efficient way to make subtraction of these large arrays. We store the first array (suppose 8 bytes per element) into binary file. The second array is stored in another file. We assume that these files are large and we cannot load any of them into memory.
That's right, we can't load the file all at one. I tried np.loadtxt and it failed!
Theoretically, we can write a program, e.g. in C, that reads these both files by chunks (since each element 8 bytes, we can read , e.g. 8*10^6 bytes at a time), do computation with these chunks, and put the result into another binary file. That would be very efficient approach; No Python, no any additional heavy libraries (like pandas, numpy etc), no overkills related with hdf-format! The bottleneck would be i/o operations, how fast is your hdd; is it ssd?! Finally, you can convert output binary file into hdf-file, if needed...
So, is this way appropriate for you? What about your hardware?
The subtraction has to be performed in Python. I know chunking could help. In fact I tried reading a chunk (N=10^6) from first file subtracting each of the chunk's element from the chunk of second file. But that still takes time as I have to grab each element of chunk of file-1 using for loop.
See my progress below; I used memory mapping as well. It is efficient if I do not do any subtraction and just go through the chunks. The "for j in range(m):" is the one that is inefficient. This is when I am just subtracting and not saving the difference. Any better logic/implementation you can think of?
size1 = 100_000_0 size2 = 100_000_0 filename = ["file-1.txt", "file-2.txt"] chunks1 = pd.read_csv(filename[0], chunksize=size1, names=['c1', 'c2', 'lt', 'rt']) fp1 = np.memmap('newfile1.dat', dtype='float64', mode='w+', shape=(size1,4)) fp2 = np.memmap('newfile2.dat', dtype='float64', mode='w+', shape=(size2,4)) for chunk1 in chunks1: # grab chunks from file-1 m, _ = chunk1.shape fp1[0:m,:] = chunk1 chunks2 = pd.read_csv(filename[1], chunksize=size2, names=['ch', 'tmstp', 'lt', 'rt']) for chunk2 in chunks2: # grab chunks from file-2 k, _ = chunk2.shape fp2[0:k, :] = chunk2 for j in range(m): # Grabbing values from file-1's chunk e1 = fp1[j,1] delta_mat = e1 - fp2 # just a test, actually e1 should be subtracted from col-2 of fp2, not the whole fp2 count += 1 fp2.flush() a += k fp1.flush() del chunks2 i += m prog_count += m