PySpark Coding Challenge - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: General Coding Help (https://python-forum.io/forum-8.html) +--- Thread: PySpark Coding Challenge (/thread-32510.html) |
PySpark Coding Challenge - cpatte7372 - Feb-14-2021 Hello Community, I have been presented with a challenge that I'm struggling with. The challenge is as follows: Write three Python functions, register them as PySpark UDF functions ans use them to produce an output dataframe. The following is a sample of the dataset, also attached: The requirement is to create the following three functions:1. get_english_name - this function should get the Species column value and return the English name. 2. get_start_year - this function should get the Period column value and return the year(an integer) when data collection began. 3. get_trend - this function should get the Annual percentage change column value and return the change trend category based on the following rules: a. Annual percentage change less than -3.00 – return 'strong decline' b. Annual percentage change between -3.00 and -0.50 (inclusive) – return 'weak decline' c. Annual percentage change between -0.50 and 0.50 (exclusive) – return 'no change' d. Annual percentage change between 0.50 and 3.00 (inclusive) – return 'weak increase' e. Annual percentage change more than 3.00 – return 'strong increase'. The functions then need to registered as PySpark UDF functions so that they can be used in PySpark. Any assitance greatly appreciated. RE: PySpark Coding Challenge - Larz60+ - Feb-14-2021 Show us what you've done so far (python code), and where you are having difficulty. RE: PySpark Coding Challenge - cpatte7372 - Feb-14-2021 def get_english_name(species): pass def get_start_year(period): pass def get_trend(annual_percentage_change): pass RE: PySpark Coding Challenge - ndc85430 - Feb-14-2021 Come on, you can't seriously consider just writing the function signatures as actual effort, can you? RE: PySpark Coding Challenge - prajwal_0078 - Jun-25-2023 import findspark findspark.init() from pyspark.sql import SparkSession from pyspark.sql.types import StringType,IntegerType,StructType,StructField,FloatType from pyspark.sql.functions import when, col, udf spark = SparkSession.builder.appName("exp").getOrCreate() sc = spark.sparkContext @udf(returnType=StringType()) def get_english_name(val): return val[0:val.index(" (")] @udf(returnType=IntegerType()) def get_start_year(val): return int(val[1:5]) @udf(returnType=StringType()) def get_trend(x): if x < -3.00: return "strong decline" elif -3.00 < x < -0.50: return "weak decline" elif -0.50 <x<0.50: return "no change" else: return "strong increase" info = [("Greenfinch (Chloris chloris)","Farmland birds","(1970-2014)",-1.13),("Siskin (Carduelis spinus)","Woodland birds","(1995-2014)",2.26), ("European shag (Phalacrocorax artistotelis)","Seabirds","(1986-2014)",-2.31),("Mute Swan (Cygnus olor)","Water and wetland birds","(1975-2014)",1.65) ,("Collared Dove (Streptopelia decaocto)","other","(1970-2014)",5.2)] schema1 = StructType( [StructField("Species", StringType()), StructField("Category", StringType()), StructField("Period", StringType()), StructField("Annual_percentage_change", FloatType()) ]) rdd = sc.parallelize(info) data = spark.createDataFrame(rdd, schema=schema1) data2 = data.withColumn("English_Name", get_english_name(col("Species")))\ .withColumn("start_yearn", get_start_year(col("Period")))\ .withColumn("Trend", get_trend(col("Annual_percentage_change"))) data2.show() spark.stop() |