Finding a way to avoid loops use - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: Data Science (https://python-forum.io/forum-44.html) +--- Thread: Finding a way to avoid loops use (/thread-39804.html) |
Finding a way to avoid loops use - paul18fr - Apr-15-2023 Hi Using loops is a bottleneck in term of performance; I’m often facing such issue, I’m not satisfied by the current code and I’m still looking for a way to speed-up it. Each element is checked one at a time; some numpy functions cannot be used: np.intersect1d for example Of course I’ll have a look to parallelization, but I’m wondering if somebody has ever found a better solution? Here and for each element of the 1rst column of the GP array, the goal is to add a column with the corresponding value of the 2nd column of the M array; note in GP, values are not unique, they are not sorted, and they do not necessarily follow each other. Note it's here a test case; in the "real life", I'm dealing with hundreds millions of rows Thanks Paul import numpy as np import time M = np.array([[ 1, 2], [ 2, 5], [ 3, 7], [ 4, 2], [ 50, 5], [ 6, 11], [ 70, 5], [ 8, 11], [ 100, 2]] ) GP = np.array([[ 1, 1, 100], [ 1, 2, 100], [ 1, 5, 100], [ 1, 3, 100], [ 1, 4, 100], [ 3, 1, 100], [ 3, 2, 100], [ 3, 3, 100], [ 3, 4, 100], [ 70, 3, 90], [ 70, 4, 90], [ 70, 2, 92], [ 70, 2, 93], [ 6, 1, 100], [ 6, 8, 100], [ 6, 5, 100], [ 6, 4, 100], [ 6, 3, 100], [ 6, 7, 100], [ 6, 2, 100], [ 6, 6, 100]] ) # GP size is increased for time purpoase n = 20 for j in range(n): GP = np.vstack((GP, GP)) r1, c1 = np.shape(M) r2, c2 = np.shape(GP) ####################################################### # solution 1 : start from each element in M even if it does not exist R1 = np.zeros(r2) t0 = time.time() for i in range(r1): Index = np.where(GP[:, 0].reshape(-1, 1) == M[i, 0])[0] R1[Index] = M[i, 1] t1 = time.time() print(f"Duration Solution1 = {(t1 - t0):.2e}") ####################################################### # solution 2 : only existing elements in GP are dealt R2 = np.zeros( (r2, 1)) t2 = time.time() t2_1 = time.time() UniqueSortedElement, ReturnIndex, ReturnInverse, ReturnOccurences = np.unique(GP[:, 0], return_index = True, return_inverse = True, return_counts = True) t2_2 = time.time() n = np.shape(UniqueSortedElement)[0] tempo = np.zeros(n) for i in range(n): Index = np.where(M[:, 0].reshape(-1, 1) == UniqueSortedElement[i])[0] tempo[i] = M[Index, 1] t2_3 = time.time() R2 = tempo[ReturnInverse] # we've been using the "return_inverse" capability to build the R2 matrix t3 = time.time() print(f"Duration Solution2 = {(t3 - t2):.2e}") print(f"\t-interm1 = = {(t2_2 - t2_1):.2e}") print(f"\t-interm2 = = {(t2_3 - t2_2):.2e}") print(f"\t-interm3 = = {(t3 - t2_3):.2e}") Check = np.array_equal(R1, R2) print(f"Are R1 and R2 identical : {Check}") # finally a new column is added to GP GP = np.hstack((GP, R1.reshape(-1, 1))) RE: Finding a way to avoid loops use - Gribouillis - Apr-15-2023 It is difficult to understand what your code does. I commented the lines 39 and 40 where you increase the size of GP in an attempt to see the result. Here is R1 after the computation [ 2. 2. 2. 2. 2. 7. 7. 7. 7. 5. 5. 5. 5. 11. 11. 11. 11. 11. 11. 11. 11.]Can you explain how this resulting array is connected to M and GP? I really don't get that. (Apr-15-2023, 09:51 AM)paul18fr Wrote: in the "real life", I'm dealing with hundreds millions of rowsI wonder what you do in real life. What do you do with this hundreds millions rows? RE: Finding a way to avoid loops use - paul18fr - Apr-15-2023 Hi Gribouillis As you did, if n is let to 1, then it becomes easier to see the results ; in other word, for each element of GP[:, 0], I want to affect the corresponding value in M[:, 1]. I hope the following picture helps a bit:[img[/img] Note:
Quote:I wonder what you do in real life. What do you do with this hundreds millions rows?I'm dealing with results coming from simulations (finite elements), and models become huger and huger, several time steps and so on ... speed is a key characteristics, loops are time consuming and I'm not satisfied by previous snippets I shared; I'm wondering if someone has a better way to do? Paul RE: Finding a way to avoid loops use - Gribouillis - Apr-15-2023 I have an alternate way of doing it using package numpy-indexed, but it is not significantly faster #### # solution 4 import numpy_indexed as npi t4_1 = time.time() R4 = npi.remap(GP[:, 0], M[:,0], M[:,1]) t4_2 = time.time()Also this one looks 3 times faster. I found it here ### solution 5 # def method_searchsort(array, keys, values): sort_idx = np.argsort(keys) idx = np.searchsorted(keys,array,sorter = sort_idx) return values[sort_idx][idx] t5_1 = time.time() R5 = method_searchsort(GP[:, 0], M[:,0], M[:,1]) t5_2 = time.time() RE: Finding a way to avoid loops use - paul18fr - Apr-15-2023 Thanks Gribouillis for the effort, I tested it on the laptop with about 700 millions of rows, and as you said, it's almost 3 times faster; let me analysing it now Paul RE: Finding a way to avoid loops use - paul18fr - Apr-16-2023 Hi Ok I guess I figured out how it works; in my case, I do not use the function and the code leads to : import numpy as np M = np.array([[ 1, 2], [ 2, 5], [ 3, 7], [ 4, 2], [ 50, 5], [ 6, 11], [ 70, 5], [ 8, 11], [ 100, 2]] ) GP = np.array([[ 1, 1, 100], [ 1, 2, 100], [ 1, 5, 100], [ 1, 3, 100], [ 1, 4, 100], [ 3, 1, 100], [ 3, 2, 100], [ 3, 3, 100], [ 3, 4, 100], [ 70, 3, 90], [ 70, 4, 90], [ 70, 2, 92], [ 70, 2, 93], [ 6, 1, 100], [ 6, 8, 100], [ 6, 5, 100], [ 6, 4, 100], [ 6, 3, 100], [ 6, 7, 100], [ 6, 2, 100], [ 6, 6, 100]] ) sort_idx = np.argsort(M[:,0]) print(f"sort_idx = {sort_idx}") idx = np.searchsorted(M[:,0], GP[:, 0], sorter = sort_idx) print(f"idx = {idx}") ################## R5 = M[:,1][sort_idx][idx] # another notation R6 = M[sort_idx[idx], 1] Check = np.array_equal(R5, R6) print(f"Are R5 and R6 identical : {Check}")However I've been a bit surprised on how "indexes" are used in R5 (R6 is more natural for me); it's not clear for me how it works => can I have a short explanation or a link whire to go to? (I looked for "array indexing" but I've not seen something similar) Thanks Paul RE: Finding a way to avoid loops use - Gribouillis - Apr-16-2023 (Apr-16-2023, 10:00 AM)paul18fr Wrote: I looked for "array indexing" but I've not seen something similarLook for "numpy cheat sheet" perhaps. |