Pyspark Window: perform sum over a window with specific conditions - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: General Coding Help (https://python-forum.io/forum-8.html) +--- Thread: Pyspark Window: perform sum over a window with specific conditions (/thread-37461.html) |
Pyspark Window: perform sum over a window with specific conditions - Shena76 - Jun-13-2022 Hello everyone, I'm just learning about pyspark windows, and I was wondering how to achieve a particular operation. I have a Pyspark table that looks like this: ID | DATE_OPEN | DATE_CLOSE | IS_DATE_WITH_EVENT | explanation 1 | 2022-03-05 | 2022-03-10 | 1 | there was an event created on 03-05 for this ID, and closed on 03-10 1 | 2022-03-06 | null______ | 0 | there was no event created for this ID on 03-06 1 | 2022-03-07 | null______ | 1 | there was an event created on 03-07 for this ID, but it hasn't been closed yet 1 | 2022-03-08 | 2022-03-08 | 1 | ... 1 | 2022-03-09 | null______ | 0 | ... 1 | 2022-03-10 | null______ | 0 | ... 1 | 2022-03-11 | 2022-03-12 | 1 | ... 1 | 2022-03-12 | null______ | 1 | ... 1 | 2022-03-13 | 2022-03-13 | 1 | ... 1 | 2022-03-14 | null______ | 0 | ... 1 | 2022-03-15 | null______ | 0 | ... 2 | ..._______ | ..._______ |...| ... ..| ..._______ | ..._______ |...| ... 2 | ..._______ | ..._______ |...| ... 3 | ..._______ | ..._______ |...| ... ..| ..._______ | ..._______ |...| ... What I want to do is, for each ID, and each date in DATE_OPEN, count the number of events created in the 3 days previous to that DATE_OPEN, that are still open at that DATE_OPEN (so those that have a DATE_CLOSE that is greater than the DATE_OPEN in question, or a DATE_CLOSE that is "null" while IS_DATE_EVENT==1). To illustrate, here is the expected result: ID | DATE_OPEN | DATE_CLOSE | IS_DATE_WITH_EVENT | EXPECTED_RESULT | explanation 1 | 2022-03-05 | 2022-03-10 | 1_________________ | 1 | event created on day D, not closed yet as of 03-05 1 | 2022-03-06 | null______ | 0_________________ | 1 | event created on D-1, not closed yet as of 03-06 1 | 2022-03-07 | null______ | 1_________________ | 2 | event created on D-2, and one on D, both not closed yet as of 03-07 1 | 2022-03-08 | 2022-03-08 | 1_________________ | 2 | event created on D-3, and one on D-1, both not closed yet as of 03-08 1 | 2022-03-09 | null______ | 0_________________ | 1 | event created on D-2, not closed yet as of 03-09 1 | 2022-03-10 | null______ | 0_________________ | 1 | event created on D-3, not closed yet as of 03-10 1 | 2022-03-11 | 2022-03-12 | 1_________________ | 1 | event created on D, not closed yet as of 03-11 1 | 2022-03-12 | null______ | 1_________________ | 1 | event created on D, not closed yet as of 03-12 1 | 2022-03-13 | 2022-03-13 | 1_________________ | 1 | event created on D-1, not closed yet as of 03-13 1 | 2022-03-14 | null______ | 0_________________ | 1 | event created on D-2, not closed yet as of 03-14 1 | 2022-03-15 | null______ | 0_________________ | 1 | event created on D-3, not closed yet as of 03-15 2 | ..._______ | ..._______ | ..._______________ |...| ... ..| ..._______ | ..._______ | ..._______________ |...| ... 2 | ..._______ | ..._______ | ..._______________ |...| ... 3 | ..._______ | ..._______ | ..._______________ |...| ... ..| ..._______ | ..._______ | ..._______________ |...| ... I have started with defining a window this way: days = lambda i: i * 86400 id_window = ( Window() .partitionBy(F.col("ID")) .orderBy(F.col("DATE_OPEN").cast("timestamp").cast("long")) .rangeBetween(-days(3), days(0)) )But I don't know if it's the window I need to enrich to solve my problem (and how to do so), or if it's directly when creating the new feature. I'm a bit stuck here ^^ I would really appreciate your help! Thanks for having taken the time to read my post :) Cheers |