##### how to retrieve sets of data
 how to retrieve sets of data paul18fr Spam, spam, eggs, and spam Posts: 282 Threads: 64 Joined: Apr 2019 Reputation: Jul-16-2024, 08:56 AM (This post was last modified: Jul-16-2024, 09:35 AM by paul18fr.) Hi, As one can see in the picture, i'm trying to retrieve sets of data (V array : 1 row = 1 set) into M array. Note data "sens" is important i.e values in yellow are expected ones, not ones in green. I cannot use `np.intersect1d` since it looks for single values at a time, not for a set. Of course the current example has been simplified, and in a real world i'm dealing with millions of rows for M / thousands for V: performance is a keypoint Well any hint is welcomed Thanks for your time Paul ```import numpy as np M = np.array([[3301, 947, 898, 899, 945, 3467, 3468, 3469, 3470], [3302, 3467, 3468, 3469, 3470, 3471, 3472, 3473, 3474], [3303, 3471, 3472, 3473, 3474, 3475, 3476, 3477, 3478], [3304, 3475, 3476, 3477, 3478, 3479, 3480, 3481, 3482], [3305, 3479, 3480, 3481, 3482, 3483, 3484, 3485, 3486], [4301, 947, 898, 899, 945, 3467, 3468, 3470, 3469], [4304, 3475, 3476, 3477, 3478, 3480, 3479, 3481, 3482]]) V = np.array([[3469, 3470], [3471, 3472], [3479, 3480]]) val, M_ind, V_ind = np.intersect1d(M[:, 1::], V, assume_unique=False, return_indices=True)``` Attached Files Thumbnail(s)     Reply paul18fr Spam, spam, eggs, and spam Posts: 282 Threads: 64 Joined: Apr 2019 Reputation: Jul-16-2024, 02:01 PM (This post was last modified: Jul-16-2024, 02:01 PM by paul18fr.) The only way i've found but it remains too slow => still looking to numpy Using less type conversion, solution2 is a bit faster than solution 1. ```# -*- coding: utf-8 -*- import numpy as np import time M = np.array([[3301, 947, 898, 899, 945, 3467, 3468, 3469, 3470], [3302, 3467, 3468, 3469, 3470, 3471, 3472, 3473, 3474], [3303, 3471, 3472, 3473, 3474, 3475, 3476, 3477, 3478], [3304, 3475, 3476, 3477, 3478, 3479, 3480, 3481, 3482], [3305, 3479, 3480, 3481, 3482, 3483, 3484, 3485, 3486], [4301, 947, 898, 899, 945, 3467, 3468, 3470, 3469], [4304, 3475, 3476, 3477, 3478, 3480, 3479, 3481, 3482]]) n1 = 1 M = np.repeat(M, n1 , axis = 0) # each row is repeated n1 times V = np.array([[3469, 3470], [3471, 3472], [3479, 3480]]) n2 = 1 V = np.repeat(V, n2, axis = 0) rM, cM = np.shape(M) rV, cV = np.shape(V) print(f"Number of iteration = {rM * rV}") ######################### ## solution 1 with tuples def Intersect2D_1(Array_A, Array_B): def MatchWithTuples(A,B): # A ever a tuple => B converted from list to tuple # right list-comprehension => only 'true" values are retrieved return A == tuple([b for b in B if b in A]) ResultsList = [True if MatchWithTuples(A = y, B = x) else False for x in Array_A for y in Array_B] return ResultsList t0 = time.time() M1 = tuple(map(tuple, M[:, 1::])) V1 = tuple(map(tuple, V)) Result1 = np.asarray(Intersect2D_1(Array_A = M1, Array_B = V1)) Result1_reshaped = Result1.reshape(rM, rV) Ind1 = np.unique(np.where(Result1_reshaped == True)[0]) t1 = time.time() print(f"Solution #1: Duration for M[{rM}, {cM}] and V[{rV}, {cV}] = {t1 - t0}") del M1, V1, Result1 ######################## ## solution 2 with lists def Intersect2D_2(Array_A, Array_B): def MatchWithLists(A,B): # A ever a list # right list-comprehension => only 'true" values are retrieved return A == [b for b in B if b in A] ResultsList = [True if MatchWithLists(A = y, B = x) else False for x in Array_A for y in Array_B] return ResultsList t2 = time.time() M2 = M[:, 1::].tolist() V2 = V.tolist() Result2 = np.asarray(Intersect2D_2(Array_A = M2, Array_B = V2)) Result2_reshaped = Result2.reshape(rM, rV) Ind2 = np.unique(np.where(Result2_reshaped == True)[0]) t3 = time.time() print(f"Solution #2: Duration for M[{rM}, {cM}] and V[{rV}, {cV}] = {t3 - t2}") del M2, V2, Result2 InvolvedElements = M[Ind2, 0] print(f"\nInvolved row indexes = {Ind2}") print(f"Involved elements = {InvolvedElements}")`````````Output:Number of iteration = 210000000 Solution #1: Duration for M[700000, 9] and V[300, 2] = 167.9941966533661 Solution #2: Duration for M[700000, 9] and V[300, 2] = 157.40234184265137 Involved row indexes = [ 0 1 2 ... 499997 499998 499999]`````` Reply paul18fr Spam, spam, eggs, and spam Posts: 282 Threads: 64 Joined: Apr 2019 Reputation: Jul-17-2024, 12:20 PM ahhhh i forgot a key point when playing with lists and tuple (see `MatchWithLists` & `MatchWithTuples` functions) : [1, 2, 3] and [1, 3] provide the same resut if i'm looking for [1, 3] exact set => it's wrong! Reply Posts: 6,477 Threads: 18 Joined: Feb 2020 Reputation: Jul-17-2024, 01:49 PM I have no clue what criteria is used to paint a cell yellow or green. Can you explain? Reply paul18fr Spam, spam, eggs, and spam Posts: 282 Threads: 64 Joined: Apr 2019 Reputation: Jul-17-2024, 02:05 PM (This post was last modified: Jul-17-2024, 02:06 PM by paul18fr.) Hi I'm looking for set of values in V that match exactly in M without any cell inbetween == order in V must be respected:correct order = yellow cells opposite order = wrong = green cells In the first picture, yellow cells were missing. In the new picture, all cells in color except green (and white) are the target Expected output: ``````Output:Number of iteration = 21 Solution #2: Duration for M[7, 9] and V[3, 2] = 0.0009970664978027344 Involved row indexes = [0 1 2 3 4] Involved elements = [3301 3302 3303 3304 3305]`````` Attached Files Thumbnail(s)     Reply paul18fr Spam, spam, eggs, and spam Posts: 282 Threads: 64 Joined: Apr 2019 Reputation: Jul-17-2024, 02:20 PM (This post was last modified: Jul-17-2024, 03:33 PM by paul18fr.) If i manually invert 2 cells in M[3, :], then index 3 becomes non-valid (see new picture). I've found cases which invalidate this hypothesis => V[:, 0] must be the first found one! I'm dealing with the column position as well (diff = 1), but the code becomes hugly and even more slower. I feel there's a better way to proceed. ```M = np.array([[3301, 947, 898, 899, 945, 3467, 3468, 3469, 3470], [3302, 3467, 3468, 3469, 3470, 3471, 3472, 3473, 3474], [3303, 3471, 3472, 3473, 3474, 3475, 3476, 3477, 3478], [3304, 3475, 3476, 3477, 3478, 3479, 3481, 3480, 3482], [3305, 3479, 3480, 3481, 3482, 3483, 3484, 3485, 3486], [4301, 947, 898, 899, 945, 3467, 3468, 3470, 3469], [4304, 3475, 3476, 3477, 3478, 3480, 3479, 3481, 3482]])``` Attached Files Thumbnail(s)     Reply Pedroski55 Giant Foot Posts: 993 Threads: 141 Joined: Jul 2017 Reputation: Jul-18-2024, 07:47 AM (This post was last modified: Jul-18-2024, 07:47 AM by Pedroski55.) I am not familiar with numpy. But you can do what you want to do like this: ```import numpy as np M = np.array([[3301, 947, 898, 899, 945, 3467, 3468, 3469, 3470], [3302, 3467, 3468, 3469, 3470, 3471, 3472, 3473, 3474], [3303, 3471, 3472, 3473, 3474, 3475, 3476, 3477, 3478], [3304, 3475, 3476, 3477, 3478, 3479, 3480, 3481, 3482], [3305, 3479, 3480, 3481, 3482, 3483, 3484, 3485, 3486], [4301, 947, 898, 899, 945, 3467, 3468, 3470, 3469], [4304, 3475, 3476, 3477, 3478, 3480, 3479, 3481, 3482]]) V = np.array([[3469, 3470], [3471, 3472], [3479, 3480]]) # if a number can appear more than 1 time in a row, things are more complicated # assume a number is only present 1 time in a row for now # assume for now that corresponding pairs of numbers can only be in the same row # not the end of 1 row and the beginning of the next row def checkrow(Mrow, rowV, rownum): print(f'This is row {rownum}') print(f'checking for sequence {rowV} in {Mrow}' ) # case rowV[0] not in Mrow if Mrow.count(rowV[0]) == 0: return False # case rowV[0] is the last element in Mrow elif Mrow[Mrow.index(rowV[0])] == Mrow[-1]: return False # case v[0] in Mrow and followed by v[1] else: index = Mrow.index(rowV[0]) if Mrow[index + 1] == rowV[1]: print('Found a match!') print(f'start index = {index}, values are {Mrow[index], Mrow[index + 1]}') count = 0 for m in M: rowM = list(m) for v in V: rowV = list(v) res = checkrow(rowM, rowV, count) count +=1```Gives: ``````Output:This is row 0 checking for sequence [3469, 3470] in [3301, 947, 898, 899, 945, 3467, 3468, 3469, 3470] Found a match! start index = 7, values are (3469, 3470) This is row 0 checking for sequence [3471, 3472] in [3301, 947, 898, 899, 945, 3467, 3468, 3469, 3470] This is row 0 checking for sequence [3479, 3480] in [3301, 947, 898, 899, 945, 3467, 3468, 3469, 3470] This is row 1 checking for sequence [3469, 3470] in [3302, 3467, 3468, 3469, 3470, 3471, 3472, 3473, 3474] Found a match! start index = 3, values are (3469, 3470) This is row 1 checking for sequence [3471, 3472] in [3302, 3467, 3468, 3469, 3470, 3471, 3472, 3473, 3474] Found a match! start index = 5, values are (3471, 3472) This is row 1 checking for sequence [3479, 3480] in [3302, 3467, 3468, 3469, 3470, 3471, 3472, 3473, 3474] This is row 2 checking for sequence [3469, 3470] in [3303, 3471, 3472, 3473, 3474, 3475, 3476, 3477, 3478] This is row 2 checking for sequence [3471, 3472] in [3303, 3471, 3472, 3473, 3474, 3475, 3476, 3477, 3478] Found a match! start index = 1, values are (3471, 3472) This is row 2 checking for sequence [3479, 3480] in [3303, 3471, 3472, 3473, 3474, 3475, 3476, 3477, 3478] This is row 3 checking for sequence [3469, 3470] in [3304, 3475, 3476, 3477, 3478, 3479, 3480, 3481, 3482] This is row 3 checking for sequence [3471, 3472] in [3304, 3475, 3476, 3477, 3478, 3479, 3480, 3481, 3482] This is row 3 checking for sequence [3479, 3480] in [3304, 3475, 3476, 3477, 3478, 3479, 3480, 3481, 3482] Found a match! start index = 5, values are (3479, 3480) This is row 4 checking for sequence [3469, 3470] in [3305, 3479, 3480, 3481, 3482, 3483, 3484, 3485, 3486] This is row 4 checking for sequence [3471, 3472] in [3305, 3479, 3480, 3481, 3482, 3483, 3484, 3485, 3486] This is row 4 checking for sequence [3479, 3480] in [3305, 3479, 3480, 3481, 3482, 3483, 3484, 3485, 3486] Found a match! start index = 1, values are (3479, 3480) This is row 5 checking for sequence [3469, 3470] in [4301, 947, 898, 899, 945, 3467, 3468, 3470, 3469] This is row 5 checking for sequence [3471, 3472] in [4301, 947, 898, 899, 945, 3467, 3468, 3470, 3469] This is row 5 checking for sequence [3479, 3480] in [4301, 947, 898, 899, 945, 3467, 3468, 3470, 3469] This is row 6 checking for sequence [3469, 3470] in [4304, 3475, 3476, 3477, 3478, 3480, 3479, 3481, 3482] This is row 6 checking for sequence [3471, 3472] in [4304, 3475, 3476, 3477, 3478, 3480, 3479, 3481, 3482] This is row 6 checking for sequence [3479, 3480] in [4304, 3475, 3476, 3477, 3478, 3480, 3479, 3481, 3482]`````` Reply Vito Unladen Swallow Posts: 1 Threads: 0 Joined: Jul 2024 Reputation: Jul-28-2024, 08:17 PM In set theory, intersection removes duplicates, which is what this operator does. Make it simpler. ```mask = np.in1d(M, V) print(mask)`````````Output:[False False False False False False False True True False False False True True True True False False False True True False False False False False False False False False False False True True False False False True True False False False False False False False False False False False False False True True False False False False False True True False False]`````` Reply

 Possibly Related Threads… Thread Author Replies Views Last Post replace sets of values in an array without using loops paul18fr 7 2,165 Jun-20-2022, 08:15 PM Last Post: paul18fr Data sets comparison Fraetos 0 1,566 Sep-14-2021, 06:45 AM Last Post: Fraetos Mann Whitney U-test on several data sets rybina 2 2,386 Jan-05-2021, 03:08 PM Last Post: rybina Least-squares fit multiple data sets multiverse22 1 2,557 Jun-06-2020, 01:38 AM Last Post: Larz60+ Clustering for imbalanced data sets dervast 0 1,796 Sep-25-2019, 06:34 AM Last Post: dervast Compare 2 Csv data sets, identify record with latest date MJUk 11 6,674 Jan-06-2018, 09:23 PM Last Post: MJUk Match two data sets based on item values klllmmm 7 6,976 Mar-29-2017, 02:33 PM Last Post: zivoni

Forum Jump:

### User Panel Messages

##### Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020