Hi I'm trying to find a simple way to run through a large data set (2432094 lines). I'm currently writing a program that takes the first 480 lines (8 hours) and creates a new array. Then I shifting by 60 lines (one hour) for line 1 and then grabbing another 480 lines and create another array. I'd like to continue this for the whole data set.
Here's what I'm currently running:
n_window = n[:480,:]
n_window1 = n[540:1020,:]
n_window2 = n[1080:1560,:]
n_window3 = n[1620:2100,:]
n_window4 = n[2160:2640,:]
n_window5 = n[2700:3180,:]
n_window6 = n[3240:3720,:]
n_window7 = n[3780:4260,:]
n_window8 = n[4320:4800,:]
n_window9 = n[4860:5340,:]
n_window11 = n[5400:5880,:]
n_window12 = n[5940:6420,:]
n_window13 = n[6480:6960,:]
n_window14 = n[7020:7500,:]
n_window15 = n[7560:8040,:]
n_window16 = n[8100:8580,:]
n_window17 = n[8640:9120,:]
new_window = np.concatenate([n_window,n_window1, n_window2,n_window3,n_window4,n_window5,n_window6,n_window7,n_window8,n_window9,n_window11,n_window12,n_window13,n_window14,n_window15,n_window16,n_window17])
Can anyone help?
from itertools import count
import numpy as np
def get_subarrays(x): # provide additional parameters, e.g. start, step etc.
for s in (x[j:i, :] for j, i in zip(count(0, 540), count(480, 540))):
if s.size:
yield s
else:
raise StopIteration
x = np.random.rand(10000, 10)
data = np.concatenate(list(get_subarrays(x)))
print(data.shape)
To get most efficient solution, play around
numpy.lib.stride_tricks.as_strided
.
Thank you Scidam for your quick response and I apologize for my slow reply. Can you give me a break down as I'm not familiar this?
Thank you in advance,
Mark
I hope, I understood you correctly. So, I provide some comments for the code above:
from itertools import count
# count returns infinite generator, e.g. count(10, 5) creates generator starting at 10 with step 5: 10, #15, 20, ... this sequence never ends
# So, if you execute
# for j in count(10, 5): # this is infinite loop
# print(j)
# that execution will never been stopped, until, e.g., Ctrl+C.
import numpy as np
# (x[j:i, :] for j, i in zip(count(0, 540), count(480, 540))) is a generator:
# it produces x[0:480, :], x[540:480+540, :], x[1080:1080+540, :] etc.
# when the index becomes greater than the size of x, this generator returns empty numpy array,
# This generator will never stop. To stop this generator we use for-loop and check size of returned
# subarray (s) on each iteration. If returned subarray becomes empty we raise the StopIteration exception.
# get_subarrays is a generator, that extracts subarrays from source array x
def get_subarrays(x):
for s in (x[j:i, :] for j, i in zip(count(0, 540), count(480, 540))):
if s.size:
yield s
else:
raise StopIteration
x = np.random.rand(10000, 10) # Test array
# We need to pass a list of arrays to be concatenated, so, lets create such list.
#list(get_subarrays(x)) is equivalent for
#res = []
#for item in get_subarrays(x): # This loop is break, when StopIteration is raised (this is common behavior for Python generators/iterators and loops)
# res.append(item)
# Now, we can pass `res` to np.concatenate, or, for short, use list(get_subarrays(x)) instead of `res`.
data = np.concatenate(list(get_subarrays(x)))
print(data.shape)
Thank you, that helps a lot. You're definitely understanding what I'm trying to do, I'm now understanding more as well. Question about variable data, is the first sub array [0:480, :]? I guess I should ask if the step generator skips all information between 0 and 540?
for j, i, k in zip(count(0, 540), count(480, 540), range(10)):
print(k, j, ':', i)
Output:
0 0 : 480
1 540 : 1020
2 1080 : 1560
3 1620 : 2100
4 2160 : 2640
5 2700 : 3180
6 3240 : 3720
7 3780 : 4260
8 4320 : 4800
9 4860 : 5340
Thank you Scidam for all your help. My code is functioning exactly as I want it to do.
Mark