Databricks Community

cotrariello · ‎11-14-2024

Hello team,

I am experiencing an issue with insert operations on Databricks using the JDBC driver. In my SAS Viya system, the DatabricksJDBC42.jar driver version 2.6.40 is configured. I’ve noticed that, up to Databricks Runtime version 13.1, insert operations are performed in bulk. However, starting from Runtime version 14.1, these insert operations are executed row by row, leading to a significant slowdown in performance.

I also noticed that with version 13, a log line is displayed (highlighted in yellow in the image on the right), while in version 14 there is no such log entry. Could there be a correlation? Could this difference be relevant?

Has anyone else encountered this issue or have any suggestions on how to resolve it? Any advice would be greatly appreciated!

Thank you!

Walter_C · ‎11-14-2024

As per the information I can see internally is that the JDBC connection is not intended to support batch runs and there is actually a feature request open to have this capability working in the near future, does the cluster in 13.1 does no have any additional set up or spark config that can be causing this different behavior?

cotrariello · ‎11-14-2024

Hi @Walter_C

Thank you for the clarification. However, I’d like to emphasize that before version 14.1, specifically in version 13.1, insert operations were indeed performed in bulk. By "bulk operations," I mean that multiple SQL statements were executed within a single insert command, as shown in the examples below:

Version 14.1:
`INFO SparkExecuteStatementOperation: Submitting query 'INSERT INTO TEST_20241113 (ID, Age, Amt) VALUES (?, ?, ?)'`

Version 13.1:
`INSERT INTO vltn_gold.TEST_20241113 (ID, Age, Amt) VALUES (1.0, 25.0, 40000.0), (2.0, 30.0, 45000.0)`

As you can see, in version 13.1, multiple rows were inserted in a single statement, whereas in version 14.1, each row is inserted individually.

The Spark settings and cluster setup (aside from the different runtime versions) are identical on both clusters, so there’s no additional configuration or setup difference that could be causing this change in behavior.

Thank you

Walter_C · ‎11-15-2024

Were you using parameterized queries? I think its native support starts from version 14.1.
See if this can mitigate the issue by EnableNativeParameterizedQuery=0

cotrariello · ‎11-18-2024

Dear @Walter_C,

Thank you for your time and patience.

Using the suggested parameter in the JDBC URL, the insert behavior has returned to normal, and everything works perfectly. However, I have not been able to find any references to this parameter in the official documentation. I have reviewed the Databricks JDBC Driver Installation and Configuration Guide but couldn’t locate any mention of it.

Could you please indicate where I can find information about this parameter?

Thank you once again for your help!

Best regards,
Carmine

cotrariello · a month ago

Dear @Walter_C ,

After an initial test with a moderate number of rows, we believed it to be the optimal solution.

However, upon increasing the row count beyond a million, we realized the dataset is divided into chunks of unclear size. Sometimes chunks are created with 25k rows, while other times with 20k rows.

Would it be possible to provide a guide on the parameters that can be used to manage chunk sizes effectively?

How I said, here we can't find anything: Databricks JDBC Driver Installation and Configuration Guide

Thank you!