Python Forum

Full Version: fast lookup for array
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hi all,

I have a large array (+1M values) where for each value I want to lookup something in a table.
What would be an efficient way to do this?

for instance,a s a small example:
A = pd.DataFrame([67,67,67,67,68,69,69,69,70,70])
Table = pd.DataFrame(np.array([[67,'a'],[68,'b'],[69,'c'],[70,'d']]),columns=['Index', 'Item'])
Result = ['a','a','a','a','b','c','c','c','d','d']
How do I best get to the desired result, especially when A is very large?

Thanks,
DataFrame.isin or Series.isin methods are efficient. Did you try these methods?
Hash tables are good for fast lockups, but they are bigger than other primitive data structures.
In Python you can use the dict, which is an implementation of hash tables.
If you use Python 3.6+, the order is also preserved, which is not common.

I don't know pandas very well and it's implementation. If pandas uses hash-lookup, then it's fast.
If not, then it's slow.

How big is the table you want to lookup?
In general, pandas invokes underlying numpy lookup engine, e.g. df.loc[:, 'somecolumn'] == 'somevalue'
is almost equivalent (internally) to df.loc[:, 'somecolumn'].values == 'somevalue', where xxx.values points to corresponding numpy array. This lookup is performed with O(N) time complexity, but it is quite fast, since it is implemented in C. In the same time, Pandas index-based lookups use hash-tables under the hood, so, looking up rows by index is much faster (if N is large) than looking them up by column value.