cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Databricks JDBC/ODBC write batch size

Alexander1
New Contributor III

I have spent way too much time to find a solution to the problem of efficiently writing data to Databricks via JDBC/ODBC. I have looked into countless docs, blogs and repos and I cannot find one example where someone is setting some kind of batch/bulk/buffer option to tell the jdbc driver to not write/insert data row-wise into a delta table. Is anyone aware of a solution? Many jdbc drivers to other destinations offer bulk/batch writes, even the Databricks MS SQL connector supports a bulkCopyToSqlDB method and similar for snowflake. Why can I not find anything when writing to Databricks. I looked into many things incl below. Any pointer is much appreciated.

1 ACCEPTED SOLUTION

Accepted Solutions

Alexander1
New Contributor III

Hi @Suteja Kanuriโ€‹ thanks for the reply. Let me take your options one by one.

  1. Apache Spark JDBC connector: I don't see a bulk/batch write option TO spark only from spark to OTHER databases; if you have a docs pointer, it would be highly appreciated;
  2. spark.databricks.delta.maxFileSize: That is an option after data is transferred, not to increase transfer speed;
  3. OPTIMIZE: see (2)
  4. rewriteBatchedStatements: see (2)

For the time being, we leverage CData Databricks JDBC driver with bulk load option (from SAS) which actually uses a two-step transfer first writing data to cloud storage (ADLS) and then using Databricks COPY INTO for load and Hive registration. It's pretty fast but has a few drawbacks such as full write only (no append/insert/update) and no schema definition (i.e. CData drives type conversion). But it's certainly better than row-wise Databricks JDBC write.

View solution in original post

5 REPLIES 5

Anonymous
Not applicable

@Alexander Wagnerโ€‹ :

Databricks Delta Lake supports bulk writes, which are more efficient than row-wise writes. However, the method for achieving this may vary depending on the specific JDBC driver being used. Here are a few possible solutions:

  1. Use the Apache Spark JDBC connector instead of the Databricks JDBC connector. This connector supports bulk inserts, which can be used with Delta Lake. You can specify the batch size using the batchsize option when creating the JDBC connection.
  2. Use the spark.databricks.delta.maxFileSize option to control the size of the files being written. By setting a large value for this option, you can ensure that Delta Lake will write larger files and improve write performance.
  3. Use the OPTIMIZE command to merge small files into larger files. This will improve query performance and reduce storage overhead. You can run this command periodically on your Delta tables to optimize them.
  4. If you are using the Databricks JDBC connector, you can try setting the rewriteBatchedStatements option to true. This can improve write performance by allowing the driver to send multiple statements in a single batch.

Keep in mind that the optimal solution may vary depending on your specific use case, so it's important to test and evaluate the performance of each approach.

Alexander1
New Contributor III

Hi @Suteja Kanuriโ€‹ thanks for the reply. Let me take your options one by one.

  1. Apache Spark JDBC connector: I don't see a bulk/batch write option TO spark only from spark to OTHER databases; if you have a docs pointer, it would be highly appreciated;
  2. spark.databricks.delta.maxFileSize: That is an option after data is transferred, not to increase transfer speed;
  3. OPTIMIZE: see (2)
  4. rewriteBatchedStatements: see (2)

For the time being, we leverage CData Databricks JDBC driver with bulk load option (from SAS) which actually uses a two-step transfer first writing data to cloud storage (ADLS) and then using Databricks COPY INTO for load and Hive registration. It's pretty fast but has a few drawbacks such as full write only (no append/insert/update) and no schema definition (i.e. CData drives type conversion). But it's certainly better than row-wise Databricks JDBC write.

Anonymous
Not applicable

@Alexander Wagnerโ€‹ :

You are correct that the Apache Spark JDBC connector does not have a bulk/batch write option to Spark itself, only from Spark to other databases.

It sounds like you have found a solution using the CData Databricks JDBC driver with bulk load option, which uses a two-step transfer approach. While this approach has some drawbacks, it seems to be working well for your use case. All the best!

Anonymous
Not applicable

Hi @Alexander Wagnerโ€‹ 

Hope everything is going great.

Just wanted to check in if you were able to resolve your issue. If yes, would you be happy to mark an answer as best so that other members can find the solution more quickly? If not, please tell us so we can help you. 

Cheers!

Alexander1
New Contributor III

@Vidula Khannaโ€‹ yes, have done so. thanks.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group