Oct-11-2021, 04:04 AM
I want pyspark code to use parallel threads when connecting to the database when i am inserting into a table but its not.
I have tried splitting the DF , also used numPartitions atribute in the write call but nothing helping .
The following code works and it writes to the table but with a single database connection .
I have tried splitting the DF , also used numPartitions atribute in the write call but nothing helping .
The following code works and it writes to the table but with a single database connection .
mport os import io import findspark import pandas as pd import boto3 import awswrangler as wr import pyspark from pyspark.sql import SparkSession spark = SparkSession.builder \ .master('local[*]') \ .config("spark.driver.memory", "25g") \ .appName('my-cool-app') \ .getOrCreate() myDF=spark.read.format('jdbc').options( url='jdbc:redshift://hostname.com:5439/dev', driver='com.amazon.redshift.jdbc42.Driver', dbtable='schema1.table1', user='awsuser', password='securepassword').load() myDF.count() myDF_part = myDF.repartition(16) myDF_part.write.format('jdbc').options( url='jdbc:oracle:thin:@oraclehost:1521/iINST1', driver='oracle.jdbc.driver.OracleDriver', dbtable='test', batchsize=10000, numPartitions=16, user='someuser', password='somepassword').mode('append').save()