Python Forum

Full Version: Slow Python Code
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hi all,

I have two lists that I want to check if the values in the list called data are equal to the values in the list called data2. If they are then append 3 values next to the value found in data2 i.e. index +1,+2 and +3 in a new list.

The length of data2 is 2364004 and data is 5131.

I have tried the following code which works but takes a long time to run.

data3.extend([data2[index+1],data2[index+2],data2[index+3]]) for i in data for index, j in enumerate(data2) if i==j
Is there a more efficient/faster way of doing this? Maybe using a numpy array function if need be or something else?

I can post the raw data if that helps

Thanks!
You could implement a list comprehension instead of list.extend().

data3 = [data2[index+1],data2[index+2],data2[index+3] for i in data for index, j in enumerate(data2) if i==j]
List comprehensions are faster and you wouldn't be issuing a method call for every match.

You could also sort data and use the bisect module to identify and target specific indices to be checked. Bisecting works by dividing the list in half over and over again until you find a match. Instead of iterating over 2.3 million items in data, you be iterating over log 2 of 2.3 million - only 21 items!

This *should* work but may need some fine tuning. Again, data needs to be sorted first. There are some highly efficient sorting algorithms that should
be able to the do the trick.

import bisect

for i, x in enumerate(data2):
    index = bisect.bisect_left(data, x)
    for y in data[index:]:
        if x != y:
            break
            
        data3.extend([data2[i+1],data2[i+2],data2[i+3]])
Another way is to use Pandas:

import pandas as pd
df1 = pd.DataFrame({'x':[1,2.3,3.2,3.3,5,6,7,8,6.2,7]}, dtype=float)
df2 = pd.DataFrame({'y': list(range(100))}, dtype=float)
s = df2.y.isin(df1.x)
pd.np.hstack([df2[s].values, df2[s.shift(1, fill_value=False)].values, df2[s.shift(2, fill_value=False)].values])
Output:
array([[ 1., 2., 3.], [ 5., 6., 7.], [ 6., 7., 8.], [ 7., 8., 9.], [ 8., 9., 10.]])
Note: you need explicitly declare data types for dataframes if you are planning to use .isin method. .isin performs some data type casting (see bug) that can lead to unexpected results, e.g. when comparing arrays of float and integer values. This bug is still actual in Pandas v0.25.0.
Thanks @ scidam and stullis both provide good avenues to explore.