Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Slow Python Code
#1
Hi all,

I have two lists that I want to check if the values in the list called data are equal to the values in the list called data2. If they are then append 3 values next to the value found in data2 i.e. index +1,+2 and +3 in a new list.

The length of data2 is 2364004 and data is 5131.

I have tried the following code which works but takes a long time to run.

data3.extend([data2[index+1],data2[index+2],data2[index+3]]) for i in data for index, j in enumerate(data2) if i==j
Is there a more efficient/faster way of doing this? Maybe using a numpy array function if need be or something else?

I can post the raw data if that helps

Thanks!
Reply
#2
You could implement a list comprehension instead of list.extend().

data3 = [data2[index+1],data2[index+2],data2[index+3] for i in data for index, j in enumerate(data2) if i==j]
List comprehensions are faster and you wouldn't be issuing a method call for every match.

You could also sort data and use the bisect module to identify and target specific indices to be checked. Bisecting works by dividing the list in half over and over again until you find a match. Instead of iterating over 2.3 million items in data, you be iterating over log 2 of 2.3 million - only 21 items!

This *should* work but may need some fine tuning. Again, data needs to be sorted first. There are some highly efficient sorting algorithms that should
be able to the do the trick.

import bisect

for i, x in enumerate(data2):
    index = bisect.bisect_left(data, x)
    for y in data[index:]:
        if x != y:
            break
            
        data3.extend([data2[i+1],data2[i+2],data2[i+3]])
Reply
#3
Another way is to use Pandas:

import pandas as pd
df1 = pd.DataFrame({'x':[1,2.3,3.2,3.3,5,6,7,8,6.2,7]}, dtype=float)
df2 = pd.DataFrame({'y': list(range(100))}, dtype=float)
s = df2.y.isin(df1.x)
pd.np.hstack([df2[s].values, df2[s.shift(1, fill_value=False)].values, df2[s.shift(2, fill_value=False)].values])
Output:
array([[ 1., 2., 3.], [ 5., 6., 7.], [ 6., 7., 8.], [ 7., 8., 9.], [ 8., 9., 10.]])
Note: you need explicitly declare data types for dataframes if you are planning to use .isin method. .isin performs some data type casting (see bug) that can lead to unexpected results, e.g. when comparing arrays of float and integer values. This bug is still actual in Pandas v0.25.0.
Reply
#4
Thanks @ scidam and stullis both provide good avenues to explore.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  My python code is running very slow on millions of records shantanu97 7 2,509 Dec-28-2021, 11:02 AM
Last Post: Larz60+
  Optmized way to rewrite this very slow code liva28 0 1,471 Jul-18-2021, 12:16 PM
Last Post: liva28
  Python file to slow, how peed up ? Leon 4 3,027 Jan-05-2019, 09:40 AM
Last Post: Gribouillis
  simple code is way too slow JAREDZ 7 7,998 Nov-11-2018, 12:03 PM
Last Post: Larz60+
  Python 2.7 Addition to dict is too slow VolanD 6 3,971 May-04-2018, 09:24 AM
Last Post: Gribouillis

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020