Python/Numpy have I already written the swiftest code for large array?

Python/Numpy have I already written the swiftest code for large array? - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: Data Science (https://python-forum.io/forum-44.html)
+--- Thread: Python/Numpy have I already written the swiftest code for large array? (/thread-6891.html)

Python/Numpy have I already written the swiftest code for large array? - justforkicks1 - Dec-12-2017

**GOAL:**
I would like to get my script total execution time down from 4 minutes to less than 30 secs. I have a large 1d array (3000000+) of distances with many duplicate distances. I am trying to write the swiftest function that returns all distances that appear n times in the array. I have written a function in numpy but there is a bottleneck at one line in the code. Swift performance is an issue because the calculations are done in a for loop for 2400 different large distance arrays.

Quote: import numpy as np
for t in range(0, 2400):
a=np.random.randint(1000000000, 5000000000, 3000000)
b=np.bincount(a,minlength=np.size(a))
c=np.where(b == 3)[0] #SLOW STATEMENT/BOTTLENECK
return c

**EXPECTED RESULTS:**
Given a 1d array of distances [2000000000,3005670000,2000000000,12345667,4000789000,12345687,12345667,2000000000,12345667]
I would expect back an array of [2000000000,12345667] when queried to return an array of all distances that appear 3 times in the main array.

What should I do?

RE: Python/Numpy have I already written the swiftest code for large array? - squenson - Dec-12-2017

The issue is that b has a length equal to np.amax(a) + 1 which is between 1 and 5 billion. Scanning this list to find the 3's takes a lot of time. I suggest that you build a dictionary with the values of a as keys and the number of occurrences as values, then you can retrieve all the keys which have the value 3. Something like this (not tested!):

# Create a dictionary with elements of a
# and the number of times they appear
mydict = {}
for each e in a:
    if e in mydict:
        mydict[e] += 1
    else:
        mydict[e] = 1

# Select keys which have the value 3
c = [k for k, v in mydict.items() if v == 3]

So instead of scanning 3 billion items (average), you will create a dictionary that requires 3 million entries by 22 times to access a single key (2**25 > 3,000,000) plus another pass to produce the final output, meaning less than 100 millions operations. More important, the time of execution is now depending on the number of elements in a, not the largest value in a.

Let us know your final code and the results of your tests!