Python Forum
Speeding up iterating over rows
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Speeding up iterating over rows
#1
I cannot figure out how to speed up the following problem by vectorising.

I have a big (millions of rows) Pandas DataFrame of price changes in time. I need to set a signal of +1 if price goes up by at least $1, and -1 if price goes down by at least $1. I set these signals in a column Signal.

I am not comparing current row to previous row only, I start with value in the first row and go over rows until the price moved by $1 either direction relative to first row, then my new starting point is that row and I compare subsequent rows relative to that row, etc. If I compared current row to previous row only, then this would be easy to do with np.where and df.shift(). I start with price from first row and iterate through rows until the price moved by $1 or more either direction. Then I take that new price from that row and iterate through subsequent rows until price moved by at least $1 relative to it, etc.

At the moment I use iterrows() with two if statements if the price goes up or down by $1, and a variable that saves the price from the row we are comparing new prices with. However this is very slow.

See attached image for an example.

Attached Files

Thumbnail(s)
   
Reply
#2
(Nov-22-2022, 01:22 PM)NeoXx Wrote: At the moment I use iterrows() with two if statements if the price goes up or down by $1, and a variable that saves the price from the row we are comparing new prices with. However this is very slow.
Look at How To Make Your Pandas Loop 71803 Times Faster
So Vectorization is way to go to make this faster.
Quote:We use the advantages of vectorization to create really fast codes.
The point is to avoid Python-level loops like in the examples before [1]
and to use optimized C code which uses the memory much more efficient.
Reply
#3
I believe NeoXx already knows that vectorization is faster than looping. The question is HOW to vectorize this particular problem. Essentially NeoXx wants to create price bins that are $1 wide. A column named "Signal" contains a +/-1 value at the start of each bin.

I haven't looked into it too much, but this does not really fit into the description of "binning", as NeoXx's only contain consecutive values. I don't think the pandas tools for binning (cut and qcut) are applicable.
Reply
#4
I was thinking, such a csv must come from a web-scraper watching a stock price.

The thing to do is make MySQL or whatever write the signal column as it happens, not wait until you have millions of rows!

I made a MySQL function to write the signal column afterwards. The column price had random values in the range >99 to <103.

I just made a spreadsheet with 4 columns, id, price, signal_ and start_value and 100 rows.

start_value is the value which is 1 or more higher or lower, I wanted to keep an eye on my MySQL user variable.

Setting the column signal_ and the column start_value for 100 rows in phpMyAdmin took 0.0185 seconds, which makes 5+ hours for 1 million rows, if my maths is correct.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Help speeding up ode_int andeye 2 2,085 Jan-27-2019, 04:04 PM
Last Post: andeye
  Need tips for speeding up python code iineo 1 2,376 May-29-2018, 12:57 AM
Last Post: micseydel
  Speeding up Twitter parser kiton 3 3,692 Jun-19-2017, 06:01 PM
Last Post: zivoni

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020