Jun-13-2022, 08:59 AM
Hello everyone,
I'm just learning about pyspark windows, and I was wondering how to achieve a particular operation.
I have a Pyspark table that looks like this:
ID | DATE_OPEN | DATE_CLOSE | IS_DATE_WITH_EVENT | explanation
1 | 2022-03-05 | 2022-03-10 | 1 | there was an event created on 03-05 for this ID, and closed on 03-10
1 | 2022-03-06 | null______ | 0 | there was no event created for this ID on 03-06
1 | 2022-03-07 | null______ | 1 | there was an event created on 03-07 for this ID, but it hasn't been closed yet
1 | 2022-03-08 | 2022-03-08 | 1 | ...
1 | 2022-03-09 | null______ | 0 | ...
1 | 2022-03-10 | null______ | 0 | ...
1 | 2022-03-11 | 2022-03-12 | 1 | ...
1 | 2022-03-12 | null______ | 1 | ...
1 | 2022-03-13 | 2022-03-13 | 1 | ...
1 | 2022-03-14 | null______ | 0 | ...
1 | 2022-03-15 | null______ | 0 | ...
2 | ..._______ | ..._______ |...| ...
..| ..._______ | ..._______ |...| ...
2 | ..._______ | ..._______ |...| ...
3 | ..._______ | ..._______ |...| ...
..| ..._______ | ..._______ |...| ...
What I want to do is, for each ID, and each date in DATE_OPEN, count the number of events created in the 3 days previous to that DATE_OPEN, that are still open at that DATE_OPEN (so those that have a DATE_CLOSE that is greater than the DATE_OPEN in question, or a DATE_CLOSE that is "null" while IS_DATE_EVENT==1).
To illustrate, here is the expected result:
ID | DATE_OPEN | DATE_CLOSE | IS_DATE_WITH_EVENT | EXPECTED_RESULT | explanation
1 | 2022-03-05 | 2022-03-10 | 1_________________ | 1 | event created on day D, not closed yet as of 03-05
1 | 2022-03-06 | null______ | 0_________________ | 1 | event created on D-1, not closed yet as of 03-06
1 | 2022-03-07 | null______ | 1_________________ | 2 | event created on D-2, and one on D, both not closed yet as of 03-07
1 | 2022-03-08 | 2022-03-08 | 1_________________ | 2 | event created on D-3, and one on D-1, both not closed yet as of 03-08
1 | 2022-03-09 | null______ | 0_________________ | 1 | event created on D-2, not closed yet as of 03-09
1 | 2022-03-10 | null______ | 0_________________ | 1 | event created on D-3, not closed yet as of 03-10
1 | 2022-03-11 | 2022-03-12 | 1_________________ | 1 | event created on D, not closed yet as of 03-11
1 | 2022-03-12 | null______ | 1_________________ | 1 | event created on D, not closed yet as of 03-12
1 | 2022-03-13 | 2022-03-13 | 1_________________ | 1 | event created on D-1, not closed yet as of 03-13
1 | 2022-03-14 | null______ | 0_________________ | 1 | event created on D-2, not closed yet as of 03-14
1 | 2022-03-15 | null______ | 0_________________ | 1 | event created on D-3, not closed yet as of 03-15
2 | ..._______ | ..._______ | ..._______________ |...| ...
..| ..._______ | ..._______ | ..._______________ |...| ...
2 | ..._______ | ..._______ | ..._______________ |...| ...
3 | ..._______ | ..._______ | ..._______________ |...| ...
..| ..._______ | ..._______ | ..._______________ |...| ...
I have started with defining a window this way:
I would really appreciate your help! Thanks for having taken the time to read my post :)
Cheers
I'm just learning about pyspark windows, and I was wondering how to achieve a particular operation.
I have a Pyspark table that looks like this:
ID | DATE_OPEN | DATE_CLOSE | IS_DATE_WITH_EVENT | explanation
1 | 2022-03-05 | 2022-03-10 | 1 | there was an event created on 03-05 for this ID, and closed on 03-10
1 | 2022-03-06 | null______ | 0 | there was no event created for this ID on 03-06
1 | 2022-03-07 | null______ | 1 | there was an event created on 03-07 for this ID, but it hasn't been closed yet
1 | 2022-03-08 | 2022-03-08 | 1 | ...
1 | 2022-03-09 | null______ | 0 | ...
1 | 2022-03-10 | null______ | 0 | ...
1 | 2022-03-11 | 2022-03-12 | 1 | ...
1 | 2022-03-12 | null______ | 1 | ...
1 | 2022-03-13 | 2022-03-13 | 1 | ...
1 | 2022-03-14 | null______ | 0 | ...
1 | 2022-03-15 | null______ | 0 | ...
2 | ..._______ | ..._______ |...| ...
..| ..._______ | ..._______ |...| ...
2 | ..._______ | ..._______ |...| ...
3 | ..._______ | ..._______ |...| ...
..| ..._______ | ..._______ |...| ...
What I want to do is, for each ID, and each date in DATE_OPEN, count the number of events created in the 3 days previous to that DATE_OPEN, that are still open at that DATE_OPEN (so those that have a DATE_CLOSE that is greater than the DATE_OPEN in question, or a DATE_CLOSE that is "null" while IS_DATE_EVENT==1).
To illustrate, here is the expected result:
ID | DATE_OPEN | DATE_CLOSE | IS_DATE_WITH_EVENT | EXPECTED_RESULT | explanation
1 | 2022-03-05 | 2022-03-10 | 1_________________ | 1 | event created on day D, not closed yet as of 03-05
1 | 2022-03-06 | null______ | 0_________________ | 1 | event created on D-1, not closed yet as of 03-06
1 | 2022-03-07 | null______ | 1_________________ | 2 | event created on D-2, and one on D, both not closed yet as of 03-07
1 | 2022-03-08 | 2022-03-08 | 1_________________ | 2 | event created on D-3, and one on D-1, both not closed yet as of 03-08
1 | 2022-03-09 | null______ | 0_________________ | 1 | event created on D-2, not closed yet as of 03-09
1 | 2022-03-10 | null______ | 0_________________ | 1 | event created on D-3, not closed yet as of 03-10
1 | 2022-03-11 | 2022-03-12 | 1_________________ | 1 | event created on D, not closed yet as of 03-11
1 | 2022-03-12 | null______ | 1_________________ | 1 | event created on D, not closed yet as of 03-12
1 | 2022-03-13 | 2022-03-13 | 1_________________ | 1 | event created on D-1, not closed yet as of 03-13
1 | 2022-03-14 | null______ | 0_________________ | 1 | event created on D-2, not closed yet as of 03-14
1 | 2022-03-15 | null______ | 0_________________ | 1 | event created on D-3, not closed yet as of 03-15
2 | ..._______ | ..._______ | ..._______________ |...| ...
..| ..._______ | ..._______ | ..._______________ |...| ...
2 | ..._______ | ..._______ | ..._______________ |...| ...
3 | ..._______ | ..._______ | ..._______________ |...| ...
..| ..._______ | ..._______ | ..._______________ |...| ...
I have started with defining a window this way:
days = lambda i: i * 86400 id_window = ( Window() .partitionBy(F.col("ID")) .orderBy(F.col("DATE_OPEN").cast("timestamp").cast("long")) .rangeBetween(-days(3), days(0)) )But I don't know if it's the window I need to enrich to solve my problem (and how to do so), or if it's directly when creating the new feature. I'm a bit stuck here ^^
I would really appreciate your help! Thanks for having taken the time to read my post :)
Cheers