Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
PySpark Coding Challenge
#1
Hello Community,

I have been presented with a challenge that I'm struggling with.

The challenge is as follows:
Write three Python functions, register them as PySpark UDF functions ans use them to produce an output dataframe.
The following is a sample of the dataset, also attached:

Output:
----------------------------------------------+-----------------------+-----------+------------------------+ |Species |Category |Period |Annual percentage change| +----------------------------------------------+-----------------------+-----------+------------------------+ |Greenfinch (Chloris chloris) |Farmland birds |(1970-2014)|-1.13 | |Siskin (Carduelis spinus) |Woodland birds |(1995-2014)|2.26 | |European shag (Phalacrocorax artistotelis) |Seabirds |(1986-2014)|-2.31 | |Mute Swan (Cygnus olor) |Water and wetland birds|(1975-2014)|1.65 | |Collared Dove (Streptopelia decaocto) |Other |(1970-2014)|5.2 | +----------------------------------------------+-----------------------+-----------+------------------------+
The requirement is to create the following three functions:

1. get_english_name - this function should get the Species column value and return the English name.

2. get_start_year - this function should get the Period column value and return the year(an integer) when data collection began.

3. get_trend - this function should get the Annual percentage change column value and return the change trend category based on the following rules:
a. Annual percentage change less than -3.00 – return 'strong decline'
b. Annual percentage change between -3.00 and -0.50 (inclusive) – return 'weak decline'
c. Annual percentage change between -0.50 and 0.50 (exclusive) – return 'no change'
d. Annual percentage change between 0.50 and 3.00 (inclusive) – return 'weak increase'
e. Annual percentage change more than 3.00 – return 'strong increase'.

The functions then need to registered as PySpark UDF functions so that they can be used in PySpark.

Any assitance greatly appreciated.
Reply


Messages In This Thread
PySpark Coding Challenge - by cpatte7372 - Feb-14-2021, 01:07 PM
RE: PySpark Coding Challenge - by Larz60+ - Feb-14-2021, 01:52 PM
RE: PySpark Coding Challenge - by cpatte7372 - Feb-14-2021, 01:54 PM
RE: PySpark Coding Challenge - by ndc85430 - Feb-14-2021, 04:49 PM
RE: PySpark Coding Challenge - by prajwal_0078 - Jun-25-2023, 12:56 PM

Possibly Related Threads…
Thread Author Replies Views Last Post
  Pyspark dataframe siddhi1919 3 1,223 Apr-25-2023, 12:39 PM
Last Post: snippsat
  pyspark help lokesh 0 759 Jan-03-2023, 04:34 PM
Last Post: lokesh
  How to iterate Groupby in Python/PySpark DrData82 2 2,836 Feb-05-2022, 09:59 PM
Last Post: DrData82
  PySpark Equivalent Code cpatte7372 0 1,266 Jan-14-2022, 08:59 PM
Last Post: cpatte7372
  Pyspark - my code works but I want to make it better Kevin 1 1,790 Dec-01-2021, 05:04 AM
Last Post: Kevin
  string format challenge jfc 2 1,787 Oct-23-2021, 10:30 AM
Last Post: ibreeden
  pyspark parallel write operation not working aliyesami 1 1,700 Oct-16-2021, 05:18 PM
Last Post: aliyesami
  pyspark creating temp files in /tmp folder aliyesami 1 4,998 Oct-16-2021, 05:15 PM
Last Post: aliyesami
  KafkaUtils module not found on spark 3 pyspark aupres 2 7,387 Feb-17-2021, 09:40 AM
Last Post: Larz60+
  pyspark dataframe to json without header vijz 0 1,959 Nov-28-2020, 05:36 PM
Last Post: vijz

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020