Integration of apache spark and Kafka on eclipse pyspark

aupres · Feb-27-2021, 06:53 AM

These are my development environments to integrate kafka and spark.

IDE : eclipse 2020-12
python : Anaconda 2020.02 (Python 3.7)
kafka : 2.13-2.7.0
spark : 3.0.1-bin-hadoop3.2

My eclipse configuration reference site is here. Simple codes of spark pyspark work successfully without errors. But integration of kafka and spark structured streaming brings the errors. These are the codes.

from pyspark.sql import SparkSession

spark = SparkSession.builder.master("local[*]").appName("appName").getOrCreate()
df = spark.read.format("kafka")\
            .option("kafka.bootstrap.servers", "localhost:9092")\
            .option("subscribe", "topicForMongoDB")\
            .option("startingOffsets", "earliest")\
            .load()\
            .selectExpr("CAST(value AS STRING) as column")
df.printSchema()
df.show()

The thrown Errors are

Error:
pyspark.sql.utils.AnalysisException: Failed to find data source: kafka. Please deploy the application as per the deployment section of "Structured Streaming + Kafka Integration Guide".;

So I insert python codes which bind the related jar files.

import os

os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.0,org.apache.spark:spark-streaming-kafka-0-10_2.12:3.1.0'

But this time another errors occurs.

Error:Error: Missing application resource.

Usage: spark-submit [options] <app jar | python file | R file> [app arguments]
Usage: spark-submit --kill [submission ID] --master [spark://...]
Usage: spark-submit --status [submission ID] --master [spark://...]
Usage: spark-submit run-example [options] example-class [example args]

Options:
  --master MASTER_URL         spark://host:port, mesos://host:port, yarn,
                              k8s://https://host:port, or local (Default: local[*]).
  --deploy-mode DEPLOY_MODE   Whether to launch the driver program locally ("client") or
                              on one of the worker machines inside the cluster ("cluster")
                              (Default: client).
  --class CLASS_NAME          Your application's main class (for Java / Scala apps).
  --name NAME                 A name of your application.
  --jars JARS                 Comma-separated list of jars to include on the driver
                              and executor classpaths.
  --packages                  Comma-separated list of maven coordinates of jars to include
                              on the driver and executor classpaths. Will search the local
                              maven repo, then maven central and any additional remote
                              repositories given by --repositories. The format for the
                              coordinates should be groupId:artifactId:version.

I am stuck here. My eclipse configuration and pyspark codes have some issues. But I have no idea what causes the errors. Kindly inform me of the integration configuration of kafka and spark pyspark. Any reply will be welcomed.

Serafim

Removed, no new info.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Active Directory integration	dady	2	502	Oct-13-2023, 04:02 AM Last Post: deanhystad
	PySpark Coding Challenge	cpatte7372	4	6,054	Jun-25-2023, 12:56 PM Last Post: prajwal_0078
	Pyspark dataframe	siddhi1919	3	1,215	Apr-25-2023, 12:39 PM Last Post: snippsat
	pyspark help	lokesh	0	753	Jan-03-2023, 04:34 PM Last Post: lokesh
	Help with Integration Pandas excel - Python	Gegemendes	5	1,783	Jun-05-2022, 09:46 PM Last Post: Gegemendes
	How to iterate Groupby in Python/PySpark	DrData82	2	2,800	Feb-05-2022, 09:59 PM Last Post: DrData82
	PySpark Equivalent Code	cpatte7372	0	1,251	Jan-14-2022, 08:59 PM Last Post: cpatte7372
	Pyspark - my code works but I want to make it better	Kevin	1	1,778	Dec-01-2021, 05:04 AM Last Post: Kevin
	pyspark parallel write operation not working	aliyesami	1	1,681	Oct-16-2021, 05:18 PM Last Post: aliyesami
	pyspark creating temp files in /tmp folder	aliyesami	1	4,956	Oct-16-2021, 05:15 PM Last Post: aliyesami

Integration of apache spark and Kafka on eclipse pyspark

User Panel Messages

Announcements