Python Forum
Pyspark Window: perform sum over a window with specific conditions - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: General Coding Help (https://python-forum.io/forum-8.html)
+--- Thread: Pyspark Window: perform sum over a window with specific conditions (/thread-37461.html)



Pyspark Window: perform sum over a window with specific conditions - Shena76 - Jun-13-2022

Hello everyone,

I'm just learning about pyspark windows, and I was wondering how to achieve a particular operation.

I have a Pyspark table that looks like this:

ID | DATE_OPEN | DATE_CLOSE | IS_DATE_WITH_EVENT | explanation
1 | 2022-03-05 | 2022-03-10 | 1 | there was an event created on 03-05 for this ID, and closed on 03-10
1 | 2022-03-06 | null______ | 0 | there was no event created for this ID on 03-06
1 | 2022-03-07 | null______ | 1 | there was an event created on 03-07 for this ID, but it hasn't been closed yet
1 | 2022-03-08 | 2022-03-08 | 1 | ...
1 | 2022-03-09 | null______ | 0 | ...
1 | 2022-03-10 | null______ | 0 | ...
1 | 2022-03-11 | 2022-03-12 | 1 | ...
1 | 2022-03-12 | null______ | 1 | ...
1 | 2022-03-13 | 2022-03-13 | 1 | ...
1 | 2022-03-14 | null______ | 0 | ...
1 | 2022-03-15 | null______ | 0 | ...
2 | ..._______ | ..._______ |...| ...
..| ..._______ | ..._______ |...| ...
2 | ..._______ | ..._______ |...| ...
3 | ..._______ | ..._______ |...| ...
..| ..._______ | ..._______ |...| ...


What I want to do is, for each ID, and each date in DATE_OPEN, count the number of events created in the 3 days previous to that DATE_OPEN, that are still open at that DATE_OPEN (so those that have a DATE_CLOSE that is greater than the DATE_OPEN in question, or a DATE_CLOSE that is "null" while IS_DATE_EVENT==1).

To illustrate, here is the expected result:

ID | DATE_OPEN | DATE_CLOSE | IS_DATE_WITH_EVENT | EXPECTED_RESULT | explanation
1 | 2022-03-05 | 2022-03-10 | 1_________________ | 1 | event created on day D, not closed yet as of 03-05
1 | 2022-03-06 | null______ | 0_________________ | 1 | event created on D-1, not closed yet as of 03-06
1 | 2022-03-07 | null______ | 1_________________ | 2 | event created on D-2, and one on D, both not closed yet as of 03-07
1 | 2022-03-08 | 2022-03-08 | 1_________________ | 2 | event created on D-3, and one on D-1, both not closed yet as of 03-08
1 | 2022-03-09 | null______ | 0_________________ | 1 | event created on D-2, not closed yet as of 03-09
1 | 2022-03-10 | null______ | 0_________________ | 1 | event created on D-3, not closed yet as of 03-10
1 | 2022-03-11 | 2022-03-12 | 1_________________ | 1 | event created on D, not closed yet as of 03-11
1 | 2022-03-12 | null______ | 1_________________ | 1 | event created on D, not closed yet as of 03-12
1 | 2022-03-13 | 2022-03-13 | 1_________________ | 1 | event created on D-1, not closed yet as of 03-13
1 | 2022-03-14 | null______ | 0_________________ | 1 | event created on D-2, not closed yet as of 03-14
1 | 2022-03-15 | null______ | 0_________________ | 1 | event created on D-3, not closed yet as of 03-15
2 | ..._______ | ..._______ | ..._______________ |...| ...
..| ..._______ | ..._______ | ..._______________ |...| ...
2 | ..._______ | ..._______ | ..._______________ |...| ...
3 | ..._______ | ..._______ | ..._______________ |...| ...
..| ..._______ | ..._______ | ..._______________ |...| ...


I have started with defining a window this way:
days = lambda i: i * 86400
id_window = (
        Window()
        .partitionBy(F.col("ID"))
        .orderBy(F.col("DATE_OPEN").cast("timestamp").cast("long"))
        .rangeBetween(-days(3), days(0))
    )
But I don't know if it's the window I need to enrich to solve my problem (and how to do so), or if it's directly when creating the new feature. I'm a bit stuck here ^^

I would really appreciate your help! Thanks for having taken the time to read my post :)

Cheers