12-20-2024 10:35 PM
df_CorpBond= spark.read.format("parquet").option("header", "true").load(f"/mnt/{container_name}/raw_data/dsl.corporate.parquet")
df_CorpBond.repartition(20).write\
.format("jdbc")\
.option("url", url_connector)\
.option("dbtable", "MarkIt_CorpBonds")\
.option("user", user)\
.option("password", pwd)\
.option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver")\
.option("numPartitions", 100)\
.option("batchsize", 100000)\
.mode("overwrite")\
.save()
this is my code to load 2.3 gb blob data into ssms table job will take more than 2 hours my cluster size is 94gb and have 1 driver node and 2 worker node how we can optimize the code
12-21-2024 06:16 AM
Is this comparing it with another job runs or by comparing it with all purpose cluster?
12-21-2024 06:41 AM
i'm trying to say my cluster have enough storage space 94gb so it can easy handle 2.3 gb data but my job taking longer time time
job 2 and 3 completed with in the 3 min
but job 4 taking longer to to complete its tasks
12-21-2024 06:21 AM
i'm trying to say my cluster have enough storage space 94gb so it can easy handle 2.3 gb data but my job taking longer time time
job 2 and 3 completed with in the 3 min
but job 4 taking longer to to complete its tasks
12-22-2024 10:03 PM
Hi @vijaypodili, wondering why did you repartition the df to 20 and then set the num partitions to 100. Also I see that your cluster has 94 gigs but what is the number of cores your cluster has?
12-23-2024 06:04 AM
Hi @aayrm5
i just want want to decrease the job time so that why i was using repartition,batch size,num partiton, but i did not work can you please suggest correct code this my worker node details
12-23-2024 11:11 PM
Hi @vijaypodili
ideally, given you have 8 cores (for 2 workers), repartition/numPartition should be a multiple of 8 (no. of cores).
Concern here is, I don't see any transformation in the code snippet shared that could potentially trigger a long running job. I strongly believe, writing it to the table in SSMS is taking longer in this case.
For the job that's taking time to execute, would you be able to share the DAG from the spark UI?
Also, check the below StackOverFlow, where folks suggesting to use various connectors to improve write performance.
12-24-2024 02:30 AM
i removed the numpartiton,batch size and repartition as well job will take almost 3hrs to write data into ssms tables
12-26-2024 01:22 AM
Instead of removing, try to tweak the num partitions, repartition and shuffle partitions to see if it increases the write speed. The Spark UI & DAG flow will exactly tell us the execution plan and we can see what's taking time to load the table to SSMS.
12-29-2024 10:43 PM
Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!
Sign Up Now