Python Forum
Python Code Review and Challenge
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Python Code Review and Challenge
#1
Hello Community,

I have been given a challenge to review the following PySpark code and comment on any areas where I see there are errors.

While I understand most of the code there are few areas that I would like clarifying. For example, there is a requirement to:

# Column F: Need to apply conversion factor of 2.5 i.e. Value 2, conversion factor 2.5 = 5
Also there is the following requirement:

# Join very_large_dataframe to small_product_dimension_dataframe on column [B]
# Only join records to small_product_dimension_dataframe where O is greater then 10
# Keep only Column [P]
The first part of the requirement is achievable using the following code:

very_large_dataframe = very_large_dataframe.join(small_product_dimension_dataframe,
                                                        (very_large_dataframe.B == small_product_dimension_dataframe.B))
However, I'm not sure how to achieve the following two requirements, the full code is below. Any thoughts greatly appreciated.

MIN_SUM_THRESHOLD = 10, 000, 000


def has_columns(data_frame, column_name_list):
    for column_name in column_name_list:
        if not data_frame.columns.contains(column_name):
            raise Exception('Column is missing: ' + column_name)


def column_count(data_frame):
    return data_frame.columns.size


def process():
    # Create spark session
    spark = SparkSession.builder.getOrCreate()

    # very_large_dataframe
    # 250 GB of CSV files from client which must have only 10 columns [A, B, C, D, E, F, G, H, I, J]
    # [A, B] contains string data
    # [C, D, E, F, G, H, I, J] contains decimals with precision 5, scale 2 (i.e. 125.75)
    # [A, B, C, D, E] should not be null
    # [F, G, H, I, J] should may be null

    very_large_dataset_location = '/Sourced/location_1'
    very_large_dataframe = spark.read.csv(very_large_dataset_location, header=True, sep="\t")

    # validate column count
    if column_count(very_large_dataframe) != 10:
        raise Exception('Incorrect column count: ' + column_count(very_large_dataframe))

    # validate that dataframe has all required columns
    has_columns(very_large_dataframe, ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'])

    # TODO
    # Column F: Need to apply conversion factor of 2.5 i.e. Value 2, conversion factor 2.5 = 5

    # Remove duplicates in column [A]
    # 50% of records in column [A] could potentially be duplicates
    very_large_dataframe = very_large_dataframe.dropDuplicates(['A'])

    # Get count of column F and ensure it is above MIN_SUM_THRESHOLD
    total_sum_of_column_F = very_large_dataframe.agg(sum('F')).collect()[0][0]

    if total_sum_of_column_F < MIN_SUM_THRESHOLD:
        raise Exception('total_sum_of_column_A: ' + total_sum_of_column_F + ' is below threshold: ' + MIN_SUM_THRESHOLD)

    # small_geography_dimension_dataframe
    # 25 MB of parquet, 4 columns [A, K, L, M]
    # Columns [A, K, L] contain only string data
    # Column [M] is an integer
    # Columns [A, K, L, M] contain all non nullable data. Assume this is the case
    small_geography_dimension_dataset = '/location_2'
    small_geography_dimension_dataframe = spark.read.parquet(small_geography_dimension_dataset)

    # Join very_large_dataframe to small_geography_dimension_dataframe on column [A]
    # Include only column [M] from small_geography_dimension_dataframe on new very_large_dataframe
    # No data (row count) loss should occur from very_large_dataframe

    very_large_dataframe = very_large_dataframe.join(small_geography_dimension_dataframe,
                                                     (very_large_dataframe.A == small_geography_dimension_dataframe.A))

    # small_product_dimension_dataframe
    # 50 MB of parquet, 4 columns [B, N, O, P]
    # Columns [B, N] contain only string data
    # Columns [O, P] contain only integers
    # Columns [B, N, O, P] contain all non nullable data. Assume this is the case
    small_product_dimension_dataset = './location_3'  # 50 MB of parquet
    small_product_dimension_dataframe = spark.read.parquet(small_product_dimension_dataset)

    # TODO
    # Join very_large_dataframe to small_product_dimension_dataframe on column [B]
    # Only join records to small_product_dimension_dataframe where O is greater then 10
    # Keep only Column [P]

    very_large_dataframe = very_large_dataframe.join(small_product_dimension_dataframe,
                                                        (very_large_dataframe.B == small_product_dimension_dataframe.B))
Reply
#2
Do you review code by someone else or are you required to write that code?
I moved the thread to Homework section of the forum.
If you can't explain it to a six year old, you don't understand it yourself, Albert Einstein
How to Ask Questions The Smart Way: link and another link
Create MCV example
Debug small programs

Reply
#3
(Jan-14-2022, 09:43 AM)buran Wrote: Do you review code by someone else or are you required to write that code?
I moved the thread to Homework section of the forum.
Thanks for reaching out. I need to review the code, however I'm also required to write the code.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Please review and help me on this code Sushmakusu 4 2,211 Aug-05-2021, 11:37 AM
Last Post: jamesaarr
  Team Chooser Challenge kanchiongspider 3 2,364 Jun-02-2020, 04:02 AM
Last Post: kanchiongspider
  Meal cost challenge Emekadavid 3 2,855 Jun-01-2020, 02:01 PM
Last Post: Emekadavid
  Requesting homework help and maybe a little code review… JonnyEnglish 0 1,598 Jan-16-2020, 04:34 PM
Last Post: JonnyEnglish
  Code Review: Range and Enumerate lummers 2 2,109 Jan-11-2020, 12:40 AM
Last Post: lummers
  Appending To Files Challenge erfanakbari1 3 2,928 Mar-27-2019, 07:55 AM
Last Post: perfringo
  Problem with a basic if challenge erfanakbari1 2 1,988 Oct-12-2018, 08:04 AM
Last Post: erfanakbari1
  trying an online challenge tozqo 8 5,958 Jun-21-2017, 07:07 AM
Last Post: Kebap

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020