Python Forum
Python/Numpy have I already written the swiftest code for large array?
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Python/Numpy have I already written the swiftest code for large array?
#1
**GOAL:**
I would like to get my script total execution time down from 4 minutes to less than 30 secs. I have a large 1d array (3000000+) of distances with many duplicate distances. I am trying to write the swiftest function that returns all distances that appear n times in the array. I have written a function in numpy but there is a bottleneck at one line in the code. Swift performance is an issue because the calculations are done in a for loop for 2400 different large distance arrays. 

Quote: import numpy as np
for t in range(0, 2400):
a=np.random.randint(1000000000, 5000000000, 3000000)
b=np.bincount(a,minlength=np.size(a))
c=np.where(b == 3)[0] #SLOW STATEMENT/BOTTLENECK
return c

**EXPECTED RESULTS:**
Given a 1d array of distances [2000000000,3005670000,2000000000,12345667,4000789000,12345687,12345667,2000000000,12345667]
I would expect back an array of [2000000000,12345667] when queried to return an array of all distances that appear 3 times in the main array.

What should I do?
Reply
#2
The issue is that b has a length equal to np.amax(a) + 1 which is between 1 and 5 billion. Scanning this list to find the 3's takes a lot of time. I suggest that you build a dictionary with the values of a as keys and the number of occurrences as values, then you can retrieve all the keys which have the value 3. Something like this (not tested!):

# Create a dictionary with elements of a
# and the number of times they appear
mydict = {}
for each e in a:
    if e in mydict:
        mydict[e] += 1
    else:
        mydict[e] = 1

# Select keys which have the value 3
c = [k for k, v in mydict.items() if v == 3]
So instead of scanning 3 billion items (average), you will create a dictionary that requires 3 million entries by 22 times to access a single key (2**25 > 3,000,000) plus another pass to produce the final output, meaning less than 100 millions operations. More important, the time of execution is now depending on the number of elements in a, not the largest value in a.

Let us know your final code and the results of your tests!
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  [Numpy] How to store different data type in one numpy array? water 7 293 Mar-26-2024, 02:18 PM
Last Post: snippsat
  reshaping 2D numpy array paul18fr 3 970 Jan-03-2023, 06:45 PM
Last Post: paul18fr
  Numpy returns "TypeError: unsupported operand type(s) for *: 'numpy.ufunc' and 'int'" kalle 2 2,530 Jul-19-2022, 06:31 AM
Last Post: paul18fr
  Numpy array BrianPA 13 4,838 Jan-23-2021, 09:36 AM
Last Post: Serafim
  How to fill datetime64 field in numpy structured array? AlekseyPython 0 2,236 Oct-20-2020, 08:17 AM
Last Post: AlekseyPython
  Adding data in 3D array from 2D numpy array asmasattar 0 2,169 Jul-23-2020, 10:55 AM
Last Post: asmasattar
  converting dataframe to int numpy array glennford49 1 2,290 Apr-04-2020, 06:15 AM
Last Post: snippsat
  Replacing sub array in Numpy array ThemePark 5 4,090 Apr-01-2020, 01:16 PM
Last Post: ThemePark
  How to prepare a NumPy array which include float type array elements subhash 0 1,884 Mar-02-2020, 06:46 AM
Last Post: subhash
  numpy.where array search for string in just one coordinate adetheheat 1 2,248 Jan-09-2020, 07:09 PM
Last Post: paul18fr

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020