cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

JDBC connector seems to be a bottleneck when trying to insert dataframe to Azure SQL Server

hfrid
New Contributor

Hi!

I am inserting a pyspark dataframe to Azure sql server and it takes a very long time. The database is a s4 but my dataframe that is 17 million rows and 30 columns takes up to 50 minutes to insert.

Is there a way to significantly speed this up? I am open for many things - switching connector, run the job in scala instead and so on. I assume my dataframe is not even that big so there must be some way to speed this up.

Also, when the dataframe is of this size, the target table is usually empty.

Thanks in advance!

1 REPLY 1

Anonymous
Not applicable

@Hjalmar Friden​ :

There are several ways to improve the performance of inserting data into Azure SQL Server using JDBC connector:

  1. Increase the batch size: By default, the JDBC connector sends data in batches of 1000 rows at a time. You can increase this value to improve the insert performance. For example, you can try setting the batch size to 10,000 rows by adding the following option to your JDBC URL: &batchsize=10000
  2. Use JDBC parallel writes: The JDBC connector in Databricks supports parallel writes to a database, which can significantly speed up the data insertion process. You can enable parallel writes by setting the following option in your JDBC URL: &numPartitions=<num_partitions> Replace <num_partitions> with the number of partitions you want to use for parallel writes. A good rule of thumb is to use the number of partitions equal to the number of CPU cores available in your cluster.
  3. Use a different JDBC connector: You can try using a different JDBC connector that is optimized for performance with Azure SQL Server. For example, Microsoft provides a JDBC driver for SQL Server that you can download and use instead of the default Databricks JDBC connector.
  4. Optimize your data frame: Depending on the structure of your data, you may be able to optimize your data frame to improve the insert performance. Some possible optimizations include:
  • Reducing the number of columns in your data frame to only the ones that are needed.
    • Changing the data types of your columns to ones that are more efficient for JDBC, such as using integers instead of strings.
    • Coalescing or repartitioning your data frame to reduce the number of partitions and improve parallelism.

5 . Use bulk insert: If you are inserting a large amount of data into an empty table, consider using the SQL Server bulk insert command instead of the JDBC connector. This can be done by writing the data frame to a CSV file, then using the BULK INSERT SQL command to load the data into SQL Server. This approach can be much faster than using JDBC for large data sets.

Note that some of these optimizations may require changes to your code or the structure of your data, so it's important to test each approach carefully to ensure that it works for your specific use case.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.