Databricks Community

hfrid · ‎04-11-2023

Hi!

I am inserting a pyspark dataframe to Azure sql server and it takes a very long time. The database is a s4 but my dataframe that is 17 million rows and 30 columns takes up to 50 minutes to insert.

Is there a way to significantly speed this up? I am open for many things - switching connector, run the job in scala instead and so on. I assume my dataframe is not even that big so there must be some way to speed this up.

Also, when the dataframe is of this size, the target table is usually empty.

Thanks in advance!

Anonymous · ‎04-15-2023

@Hjalmar Friden :

There are several ways to improve the performance of inserting data into Azure SQL Server using JDBC connector:

Increase the batch size: By default, the JDBC connector sends data in batches of 1000 rows at a time. You can increase this value to improve the insert performance. For example, you can try setting the batch size to 10,000 rows by adding the following option to your JDBC URL: &batchsize=10000
Use JDBC parallel writes: The JDBC connector in Databricks supports parallel writes to a database, which can significantly speed up the data insertion process. You can enable parallel writes by setting the following option in your JDBC URL: &numPartitions=<num_partitions> Replace <num_partitions> with the number of partitions you want to use for parallel writes. A good rule of thumb is to use the number of partitions equal to the number of CPU cores available in your cluster.
Use a different JDBC connector: You can try using a different JDBC connector that is optimized for performance with Azure SQL Server. For example, Microsoft provides a JDBC driver for SQL Server that you can download and use instead of the default Databricks JDBC connector.
Optimize your data frame: Depending on the structure of your data, you may be able to optimize your data frame to improve the insert performance. Some possible optimizations include:

Reducing the number of columns in your data frame to only the ones that are needed.
- Changing the data types of your columns to ones that are more efficient for JDBC, such as using integers instead of strings.
- Coalescing or repartitioning your data frame to reduce the number of partitions and improve parallelism.

5 . Use bulk insert: If you are inserting a large amount of data into an empty table, consider using the SQL Server bulk insert command instead of the JDBC connector. This can be done by writing the data frame to a CSV file, then using the BULK INSERT SQL command to load the data into SQL Server. This approach can be much faster than using JDBC for large data sets.

Note that some of these optimizations may require changes to your code or the structure of your data, so it's important to test each approach carefully to ensure that it works for your specific use case.

Databricks Community

JDBC connector seems to be a bottleneck when trying to insert dataframe to Azure SQL Server

Connect with Databricks Users in Your Area

Databricks Learning Festival (Virtual): 15 January - 31 January 2025

Milestone: DatabricksTV Reaches 100 Videos!

Announcing the new Meta Llama 3.3 model on Databricks

Databricks Community Champion - December 2024 - Sujesh Menon

Dotmatics and Databricks Partner to Advance Scientific Intelligence in Life Sciences