cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Speed Up JDBC Write from Databricks Notebook to MS SQL Server

berserkersap
Contributor

Hello Everyone,

I have a use case where I need to write a delta table from DataBricks to a SQL Server Table using Pyspark/ python/ spark SQL .

The delta table I am writing contains around 3 million records and the SQL Server Table is neither partitioned nor indexed. It is taking around 8-10 min to write into the SQL Server Table no matter what compute I use (using default JDBC driver).

The code being used:

 

df.write.mode("overwrite").option("truncate", True ).jdbc(Url, "dbtable" , properties = properties )

 

I have executed this code on DataBricks runtime 11.3LTS using the compute E4ds v4 and E16ds v4 but in both the cases it took 8-10 min.

Can anyone please suggest a way to reduce this time ?

 

P.S. I have tried to increase the batch size but even that was not helping. If it was more than 20000 the process became even slower

Also, I cannot install anything on the cluster according to the client's requirement. But if this is absolutely necessary please let me know what to install and how to install.

I have tried using the driver "com.microsoft.sqlserver.jdbc.spark", however, it gave me an error 

 

java.lang.ClassNotFoundException: Failed to find data source: com.microsoft.sqlserver.jdbc.spark.

 

 Thank You

1 REPLY 1

Kaniz
Community Manager
Community Manager

Hi @berserkersapPartition the DataFrame before writing to take advantage of parallelism and speed up the process.
- Example: df.repartition(10).write.mode("overwrite").option("truncate", True).jdbc(Url, "dbtable", properties=properties)

  1. Tune the JDBC batch size to optimize write performance.
    - Example: Set the batch size to 1000 - properties = {"batchsize": "1000"}
    - Experiment with different batch sizes to find the optimal value.
  2. Consider using a more efficient JDBC driver if possible.
    - The "com.microsoft.sqlserver.jdbc.spark" driver may improve performance.
    - Ensure the driver is installed on the Spark cluster.

Note: Depending on the specific scenario, these optimizations may not significantly reduce the time.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.