cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

databricks job taking longer time to load 2.3 gb data from bolb to ssms table

vijaypodili
New Contributor III
df_CorpBond= spark.read.format("parquet").option("header", "true").load(f"/mnt/{container_name}/raw_data/dsl.corporate.parquet")
df_CorpBond.repartition(20).write\
    .format("jdbc")\
    .option("url", url_connector)\
    .option("dbtable", "MarkIt_CorpBonds")\
    .option("user", user)\
    .option("password", pwd)\
    .option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver")\
    .option("numPartitions", 100)\
    .option("batchsize", 100000)\
    .mode("overwrite")\
    .save()

this is my code to load 2.3 gb blob data into ssms table job will take more than 2 hours my cluster size is 94gb and have 1 driver  node and 2 worker node how we can optimize the code  
9 REPLIES 9

Walter_C
Databricks Employee
Databricks Employee

Is this comparing it with another job runs or by comparing it with all purpose cluster? 

vijaypodili
New Contributor III

i'm trying to say my cluster have enough storage space 94gb so it can easy handle 2.3 gb data but my job taking longer time time 

vijaypodili_0-1734792013095.png

 

job 2  and 3 completed with in the 3 min 

but job 4 taking longer to to complete its tasks

vijaypodili
New Contributor III

i'm trying to say my cluster have enough storage space 94gb so it can easy handle 2.3 gb data but my job taking longer time time 

vijaypodili_0-1734790771188.png

job 2  and 3 completed with in the 3 min 

but job 4 taking longer to to complete its tasks

aayrm5
Honored Contributor

Hi @vijaypodili, wondering why did you repartition the df to 20 and then set the num partitions to 100. Also I see that your cluster has 94 gigs but what is the number of cores your cluster has? 

Riz

vijaypodili
New Contributor III

Hi @aayrm5 

i just want want to decrease the job time so that why i was using repartition,batch size,num partiton, but i did not work can you please suggest correct code this my worker node details 

1-2 Workers32-64 GB Memory4-8 Cores
1 Driver32 GB Memory, 4 Cores
Runtime14.3.x-scala2.12

aayrm5
Honored Contributor

Hi @vijaypodili 

ideally, given you have 8 cores (for 2 workers), repartition/numPartition should be a multiple of 8 (no. of cores).

Concern here is, I don't see any transformation in the code snippet shared that could potentially trigger a long running job. I strongly believe, writing it to the table in SSMS is taking longer in this case. 

For the job that's taking time to execute, would you be able to share the DAG from the spark UI?

Also, check the below StackOverFlow, where folks suggesting to use various connectors to improve write performance.

Riz

vijaypodili
New Contributor III

i removed the numpartiton,batch size and repartition as well job will take almost 3hrs to write data into ssms tables

 

aayrm5
Honored Contributor

Instead of removing, try to tweak the num partitions, repartition and shuffle partitions to see if it increases the write speed. The Spark UI & DAG flow will exactly tell us the execution plan and we can see what's taking time to load the table to SSMS.

Riz

vijaypodili
New Contributor III

Hi @aayrm5 
this is my dag digram

vijaypodili_0-1735540926974.pngvijaypodili_1-1735540954147.png

file size is 3.5 gb and in future we need to load 14gb as well