pyspark parallel write operation not working - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: General Coding Help (https://python-forum.io/forum-8.html) +--- Thread: pyspark parallel write operation not working (/thread-35229.html) |
pyspark parallel write operation not working - aliyesami - Oct-11-2021 I want pyspark code to use parallel threads when connecting to the database when i am inserting into a table but its not. I have tried splitting the DF , also used numPartitions atribute in the write call but nothing helping . The following code works and it writes to the table but with a single database connection . mport os import io import findspark import pandas as pd import boto3 import awswrangler as wr import pyspark from pyspark.sql import SparkSession spark = SparkSession.builder \ .master('local[*]') \ .config("spark.driver.memory", "25g") \ .appName('my-cool-app') \ .getOrCreate() myDF=spark.read.format('jdbc').options( url='jdbc:redshift://hostname.com:5439/dev', driver='com.amazon.redshift.jdbc42.Driver', dbtable='schema1.table1', user='awsuser', password='securepassword').load() myDF.count() myDF_part = myDF.repartition(16) myDF_part.write.format('jdbc').options( url='jdbc:oracle:thin:@oraclehost:1521/iINST1', driver='oracle.jdbc.driver.OracleDriver', dbtable='test', batchsize=10000, numPartitions=16, user='someuser', password='somepassword').mode('append').save() RE: pyspark parallel write operation not working - aliyesami - Oct-16-2021 There must be many people who are writing to the database from python , no one ever wanted to use more than one session to do this? |