PySpark Coding Challenge

cpatte7372 · Feb-14-2021, 01:07 PM

Hello Community,

I have been presented with a challenge that I'm struggling with.

The challenge is as follows:
Write three Python functions, register them as PySpark UDF functions ans use them to produce an output dataframe.
The following is a sample of the dataset, also attached:

Output:----------------------------------------------+-----------------------+-----------+------------------------+
|Species                                       |Category               |Period     |Annual percentage change|
+----------------------------------------------+-----------------------+-----------+------------------------+
|Greenfinch (Chloris chloris)                  |Farmland birds         |(1970-2014)|-1.13                   |
|Siskin (Carduelis spinus)                     |Woodland birds         |(1995-2014)|2.26                    |
|European shag (Phalacrocorax artistotelis)    |Seabirds               |(1986-2014)|-2.31                   |
|Mute Swan (Cygnus olor)                       |Water and wetland birds|(1975-2014)|1.65                    |
|Collared Dove (Streptopelia decaocto)         |Other                  |(1970-2014)|5.2                     |
+----------------------------------------------+-----------------------+-----------+------------------------+

The requirement is to create the following three functions:

1. get_english_name - this function should get the Species column value and return the English name.

2. get_start_year - this function should get the Period column value and return the year(an integer) when data collection began.

3. get_trend - this function should get the Annual percentage change column value and return the change trend category based on the following rules:
a. Annual percentage change less than -3.00 – return 'strong decline'
b. Annual percentage change between -3.00 and -0.50 (inclusive) – return 'weak decline'
c. Annual percentage change between -0.50 and 0.50 (exclusive) – return 'no change'
d. Annual percentage change between 0.50 and 3.00 (inclusive) – return 'weak increase'
e. Annual percentage change more than 3.00 – return 'strong increase'.

The functions then need to registered as PySpark UDF functions so that they can be used in PySpark.

Any assitance greatly appreciated.

**Larz60+** · Feb-14-2021, 01:52 PM

Show us what you've done so far (python code), and where you are having difficulty.

cpatte7372 · Feb-14-2021, 01:54 PM

def get_english_name(species):
pass

def get_start_year(period):
pass

def get_trend(annual_percentage_change):
pass

ndc85430 · Feb-14-2021, 04:49 PM

Come on, you can't seriously consider just writing the function signatures as actual effort, can you?

prajwal_0078 · (This post was last modified: Jun-25-2023, 03:13 PM by Gribouillis.)

import findspark
findspark.init()
from pyspark.sql import SparkSession
from pyspark.sql.types import StringType,IntegerType,StructType,StructField,FloatType
from pyspark.sql.functions import when, col, udf
spark = SparkSession.builder.appName("exp").getOrCreate()
sc = spark.sparkContext
@udf(returnType=StringType())
def get_english_name(val):
    return val[0:val.index(" (")]

@udf(returnType=IntegerType())
def get_start_year(val):
    return int(val[1:5])

@udf(returnType=StringType())
def get_trend(x):
    if x < -3.00:
        return "strong decline"
    elif -3.00 < x < -0.50:
        return "weak decline"
    elif -0.50 <x<0.50:
        return "no change"
    else:
        return "strong increase"
    
info = [("Greenfinch (Chloris chloris)","Farmland birds","(1970-2014)",-1.13),("Siskin (Carduelis spinus)","Woodland birds","(1995-2014)",2.26),
        ("European shag (Phalacrocorax artistotelis)","Seabirds","(1986-2014)",-2.31),("Mute Swan (Cygnus olor)","Water and wetland birds","(1975-2014)",1.65)
        ,("Collared Dove (Streptopelia decaocto)","other","(1970-2014)",5.2)] 
schema1 = StructType(
    [StructField("Species", StringType()),
     StructField("Category", StringType()),
     StructField("Period", StringType()),
     StructField("Annual_percentage_change", FloatType())
     ])

rdd = sc.parallelize(info)
data = spark.createDataFrame(rdd, schema=schema1) 

data2 = data.withColumn("English_Name", get_english_name(col("Species")))\
    .withColumn("start_yearn", get_start_year(col("Period")))\
        .withColumn("Trend", get_trend(col("Annual_percentage_change")))
data2.show()
spark.stop()

Gribouillis write Jun-25-2023, 03:13 PM:
Please post all code, output and errors (it it's entirety) between their respective tags. Refer to BBCode help topic on how to post. Use the "Preview Post" button to make sure the code is presented as you expect before hitting the "Post Reply/Thread" button.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Pyspark dataframe	siddhi1919	3	2,180	Apr-25-2023, 12:39 PM Last Post: snippsat
	pyspark help	lokesh	0	1,219	Jan-03-2023, 04:34 PM Last Post: lokesh
	How to iterate Groupby in Python/PySpark	DrData82	2	4,002	Feb-05-2022, 09:59 PM Last Post: DrData82
	PySpark Equivalent Code	cpatte7372	0	1,747	Jan-14-2022, 08:59 PM Last Post: cpatte7372
	Pyspark - my code works but I want to make it better	Kevin	1	2,403	Dec-01-2021, 05:04 AM Last Post: Kevin
	string format challenge	jfc	2	2,462	Oct-23-2021, 10:30 AM Last Post: ibreeden
	pyspark parallel write operation not working	aliyesami	1	2,434	Oct-16-2021, 05:18 PM Last Post: aliyesami
	pyspark creating temp files in /tmp folder	aliyesami	1	7,153	Oct-16-2021, 05:15 PM Last Post: aliyesami
	KafkaUtils module not found on spark 3 pyspark	aupres	2	8,949	Feb-17-2021, 09:40 AM Last Post: Larz60+
	pyspark dataframe to json without header	vijz	0	2,590	Nov-28-2020, 05:36 PM Last Post: vijz

PySpark Coding Challenge

User Panel Messages

Announcements