Databricks

Alexander1 · ‎02-03-2023

I have spent way too much time to find a solution to the problem of efficiently writing data to Databricks via JDBC/ODBC. I have looked into countless docs, blogs and repos and I cannot find one example where someone is setting some kind of batch/bulk/buffer option to tell the jdbc driver to not write/insert data row-wise into a delta table. Is anyone aware of a solution? Many jdbc drivers to other destinations offer bulk/batch writes, even the Databricks MS SQL connector supports a bulkCopyToSqlDB method and similar for snowflake. Why can I not find anything when writing to Databricks. I looked into many things incl below. Any pointer is much appreciated.

Alexander1 · ‎04-03-2023

Hi @Suteja Kanuri thanks for the reply. Let me take your options one by one.

Apache Spark JDBC connector: I don't see a bulk/batch write option TO spark only from spark to OTHER databases; if you have a docs pointer, it would be highly appreciated;
spark.databricks.delta.maxFileSize: That is an option after data is transferred, not to increase transfer speed;
OPTIMIZE: see (2)
rewriteBatchedStatements: see (2)

For the time being, we leverage CData Databricks JDBC driver with bulk load option (from SAS) which actually uses a two-step transfer first writing data to cloud storage (ADLS) and then using Databricks COPY INTO for load and Hive registration. It's pretty fast but has a few drawbacks such as full write only (no append/insert/update) and no schema definition (i.e. CData drives type conversion). But it's certainly better than row-wise Databricks JDBC write.

View solution in original post

Anonymous · ‎03-31-2023

@Alexander Wagner :

Databricks Delta Lake supports bulk writes, which are more efficient than row-wise writes. However, the method for achieving this may vary depending on the specific JDBC driver being used. Here are a few possible solutions:

Use the Apache Spark JDBC connector instead of the Databricks JDBC connector. This connector supports bulk inserts, which can be used with Delta Lake. You can specify the batch size using the batchsize option when creating the JDBC connection.
Use the spark.databricks.delta.maxFileSize option to control the size of the files being written. By setting a large value for this option, you can ensure that Delta Lake will write larger files and improve write performance.
Use the OPTIMIZE command to merge small files into larger files. This will improve query performance and reduce storage overhead. You can run this command periodically on your Delta tables to optimize them.
If you are using the Databricks JDBC connector, you can try setting the rewriteBatchedStatements option to true. This can improve write performance by allowing the driver to send multiple statements in a single batch.

Keep in mind that the optimal solution may vary depending on your specific use case, so it's important to test and evaluate the performance of each approach.

Alexander1 · ‎04-03-2023

Hi @Suteja Kanuri thanks for the reply. Let me take your options one by one.

Apache Spark JDBC connector: I don't see a bulk/batch write option TO spark only from spark to OTHER databases; if you have a docs pointer, it would be highly appreciated;
spark.databricks.delta.maxFileSize: That is an option after data is transferred, not to increase transfer speed;
OPTIMIZE: see (2)
rewriteBatchedStatements: see (2)

For the time being, we leverage CData Databricks JDBC driver with bulk load option (from SAS) which actually uses a two-step transfer first writing data to cloud storage (ADLS) and then using Databricks COPY INTO for load and Hive registration. It's pretty fast but has a few drawbacks such as full write only (no append/insert/update) and no schema definition (i.e. CData drives type conversion). But it's certainly better than row-wise Databricks JDBC write.

Anonymous · ‎04-05-2023

@Alexander Wagner :

You are correct that the Apache Spark JDBC connector does not have a bulk/batch write option to Spark itself, only from Spark to other databases.

It sounds like you have found a solution using the CData Databricks JDBC driver with bulk load option, which uses a two-step transfer approach. While this approach has some drawbacks, it seems to be working well for your use case. All the best!

Anonymous · ‎04-09-2023

Hi @Alexander Wagner

Hope everything is going great.

Just wanted to check in if you were able to resolve your issue. If yes, would you be happy to mark an answer as best so that other members can find the solution more quickly? If not, please tell us so we can help you.

Cheers!