PySpark Coding Challenge

cpatte7372 · Feb-14-2021, 01:07 PM

Hello Community,

I have been presented with a challenge that I'm struggling with.

The challenge is as follows:
Write three Python functions, register them as PySpark UDF functions ans use them to produce an output dataframe.
The following is a sample of the dataset, also attached:

Output:----------------------------------------------+-----------------------+-----------+------------------------+
|Species                                       |Category               |Period     |Annual percentage change|
+----------------------------------------------+-----------------------+-----------+------------------------+
|Greenfinch (Chloris chloris)                  |Farmland birds         |(1970-2014)|-1.13                   |
|Siskin (Carduelis spinus)                     |Woodland birds         |(1995-2014)|2.26                    |
|European shag (Phalacrocorax artistotelis)    |Seabirds               |(1986-2014)|-2.31                   |
|Mute Swan (Cygnus olor)                       |Water and wetland birds|(1975-2014)|1.65                    |
|Collared Dove (Streptopelia decaocto)         |Other                  |(1970-2014)|5.2                     |
+----------------------------------------------+-----------------------+-----------+------------------------+

The requirement is to create the following three functions:

1. get_english_name - this function should get the Species column value and return the English name.

2. get_start_year - this function should get the Period column value and return the year(an integer) when data collection began.

3. get_trend - this function should get the Annual percentage change column value and return the change trend category based on the following rules:
a. Annual percentage change less than -3.00 – return 'strong decline'
b. Annual percentage change between -3.00 and -0.50 (inclusive) – return 'weak decline'
c. Annual percentage change between -0.50 and 0.50 (exclusive) – return 'no change'
d. Annual percentage change between 0.50 and 3.00 (inclusive) – return 'weak increase'
e. Annual percentage change more than 3.00 – return 'strong increase'.

The functions then need to registered as PySpark UDF functions so that they can be used in PySpark.

Any assitance greatly appreciated.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Pyspark dataframe	siddhi1919	3	1,280	Apr-25-2023, 12:39 PM Last Post: snippsat
	pyspark help	lokesh	0	783	Jan-03-2023, 04:34 PM Last Post: lokesh
	How to iterate Groupby in Python/PySpark	DrData82	2	2,940	Feb-05-2022, 09:59 PM Last Post: DrData82
	PySpark Equivalent Code	cpatte7372	0	1,298	Jan-14-2022, 08:59 PM Last Post: cpatte7372
	Pyspark - my code works but I want to make it better	Kevin	1	1,833	Dec-01-2021, 05:04 AM Last Post: Kevin
	string format challenge	jfc	2	1,816	Oct-23-2021, 10:30 AM Last Post: ibreeden
	pyspark parallel write operation not working	aliyesami	1	1,750	Oct-16-2021, 05:18 PM Last Post: aliyesami
	pyspark creating temp files in /tmp folder	aliyesami	1	5,155	Oct-16-2021, 05:15 PM Last Post: aliyesami
	KafkaUtils module not found on spark 3 pyspark	aupres	2	7,508	Feb-17-2021, 09:40 AM Last Post: Larz60+
	pyspark dataframe to json without header	vijz	0	1,992	Nov-28-2020, 05:36 PM Last Post: vijz

PySpark Coding Challenge

User Panel Messages

Announcements