Python Forum
How to iterate Groupby in Python/PySpark
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
How to iterate Groupby in Python/PySpark
#1
I have to create several summary tables and include 2 below for simplicity. I'm trying to think of a way to minimize the code...rather than type out 10 of these blocks. Is there a straightforward way to iterate by the groupby variables? The first run uses col1/col2, the second run uses col3/col4, and on and on.

I'm somewhat new to Python, so appreciate any advice!

NEED1= HAVE.groupBy('col1',"col2")\
                      .agg(F.sum('col5').alias('col5'), \
                             F.sum('col6').alias('col6'), \
                             F.sum('col7').alias('col7'), \
                             F.sum('col8').alias('col8')) \
                      .sort('col1','col2')
NEED2= HAVE.groupBy('col3',"col4")\
                      .agg(F.sum('col5').alias('col5'), \
                             F.sum('col6').alias('col6'), \
                             F.sum('col7').alias('col7'), \
                             F.sum('col8').alias('col8')) \
                      .sort('col3','col4')

NEED1.show()
NEED2.show()
Reply
#2
something like:
def create_group(col1x, coly):
    return HAVE.groupBy(colx, coly), \
        .agg(F.sum('col5').alias('col5'), \
        F.sum('col6').alias('col6'), \
        F.sum('col7').alias('col7'), \
        F.sum('col8').alias('col8')) \
        .sort(colx, coly)

def groups():
    NEED1 = create_group(col1, col2)
    NEED2 = create_group(col3, col4)
DrData82 likes this post
Reply
#3
This worked perfectly, thank you!
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  PySpark Coding Challenge cpatte7372 4 6,090 Jun-25-2023, 12:56 PM
Last Post: prajwal_0078
  Pyspark dataframe siddhi1919 3 1,222 Apr-25-2023, 12:39 PM
Last Post: snippsat
  pyspark help lokesh 0 759 Jan-03-2023, 04:34 PM
Last Post: lokesh
  PySpark Equivalent Code cpatte7372 0 1,266 Jan-14-2022, 08:59 PM
Last Post: cpatte7372
  Pyspark - my code works but I want to make it better Kevin 1 1,789 Dec-01-2021, 05:04 AM
Last Post: Kevin
  pyspark parallel write operation not working aliyesami 1 1,695 Oct-16-2021, 05:18 PM
Last Post: aliyesami
  pyspark creating temp files in /tmp folder aliyesami 1 4,994 Oct-16-2021, 05:15 PM
Last Post: aliyesami
  KafkaUtils module not found on spark 3 pyspark aupres 2 7,386 Feb-17-2021, 09:40 AM
Last Post: Larz60+
  python matplotlib groupby okpython 0 1,143 Feb-08-2021, 11:09 AM
Last Post: okpython
  pyspark dataframe to json without header vijz 0 1,955 Nov-28-2020, 05:36 PM
Last Post: vijz

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020