Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
PySpark Coding Challenge
#5
import findspark
findspark.init()
from pyspark.sql import SparkSession
from pyspark.sql.types import StringType,IntegerType,StructType,StructField,FloatType
from pyspark.sql.functions import when, col, udf
spark = SparkSession.builder.appName("exp").getOrCreate()
sc = spark.sparkContext
@udf(returnType=StringType())
def get_english_name(val):
    return val[0:val.index(" (")]

@udf(returnType=IntegerType())
def get_start_year(val):
    return int(val[1:5])

@udf(returnType=StringType())
def get_trend(x):
    if x < -3.00:
        return "strong decline"
    elif -3.00 < x < -0.50:
        return "weak decline"
    elif -0.50 <x<0.50:
        return "no change"
    else:
        return "strong increase"
    
info = [("Greenfinch (Chloris chloris)","Farmland birds","(1970-2014)",-1.13),("Siskin (Carduelis spinus)","Woodland birds","(1995-2014)",2.26),
        ("European shag (Phalacrocorax artistotelis)","Seabirds","(1986-2014)",-2.31),("Mute Swan (Cygnus olor)","Water and wetland birds","(1975-2014)",1.65)
        ,("Collared Dove (Streptopelia decaocto)","other","(1970-2014)",5.2)] 
schema1 = StructType(
    [StructField("Species", StringType()),
     StructField("Category", StringType()),
     StructField("Period", StringType()),
     StructField("Annual_percentage_change", FloatType())
     ])

rdd = sc.parallelize(info)
data = spark.createDataFrame(rdd, schema=schema1) 

data2 = data.withColumn("English_Name", get_english_name(col("Species")))\
    .withColumn("start_yearn", get_start_year(col("Period")))\
        .withColumn("Trend", get_trend(col("Annual_percentage_change")))
data2.show()
spark.stop()
Gribouillis write Jun-25-2023, 03:13 PM:
Please post all code, output and errors (it it's entirety) between their respective tags. Refer to BBCode help topic on how to post. Use the "Preview Post" button to make sure the code is presented as you expect before hitting the "Post Reply/Thread" button.
Reply


Messages In This Thread
PySpark Coding Challenge - by cpatte7372 - Feb-14-2021, 01:07 PM
RE: PySpark Coding Challenge - by Larz60+ - Feb-14-2021, 01:52 PM
RE: PySpark Coding Challenge - by cpatte7372 - Feb-14-2021, 01:54 PM
RE: PySpark Coding Challenge - by ndc85430 - Feb-14-2021, 04:49 PM
RE: PySpark Coding Challenge - by prajwal_0078 - Jun-25-2023, 12:56 PM

Possibly Related Threads…
Thread Author Replies Views Last Post
  Pyspark dataframe siddhi1919 3 1,253 Apr-25-2023, 12:39 PM
Last Post: snippsat
  pyspark help lokesh 0 774 Jan-03-2023, 04:34 PM
Last Post: lokesh
  How to iterate Groupby in Python/PySpark DrData82 2 2,881 Feb-05-2022, 09:59 PM
Last Post: DrData82
  PySpark Equivalent Code cpatte7372 0 1,278 Jan-14-2022, 08:59 PM
Last Post: cpatte7372
  Pyspark - my code works but I want to make it better Kevin 1 1,807 Dec-01-2021, 05:04 AM
Last Post: Kevin
  string format challenge jfc 2 1,806 Oct-23-2021, 10:30 AM
Last Post: ibreeden
  pyspark parallel write operation not working aliyesami 1 1,729 Oct-16-2021, 05:18 PM
Last Post: aliyesami
  pyspark creating temp files in /tmp folder aliyesami 1 5,099 Oct-16-2021, 05:15 PM
Last Post: aliyesami
  KafkaUtils module not found on spark 3 pyspark aupres 2 7,458 Feb-17-2021, 09:40 AM
Last Post: Larz60+
  pyspark dataframe to json without header vijz 0 1,979 Nov-28-2020, 05:36 PM
Last Post: vijz

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020