11-14-2024 03:02 AM
Hello team,
I am experiencing an issue with insert operations on Databricks using the JDBC driver. In my SAS Viya system, the DatabricksJDBC42.jar driver version 2.6.40 is configured. I’ve noticed that, up to Databricks Runtime version 13.1, insert operations are performed in bulk. However, starting from Runtime version 14.1, these insert operations are executed row by row, leading to a significant slowdown in performance.
I also noticed that with version 13, a log line is displayed (highlighted in yellow in the image on the right), while in version 14 there is no such log entry. Could there be a correlation? Could this difference be relevant?
Has anyone else encountered this issue or have any suggestions on how to resolve it? Any advice would be greatly appreciated!
Thank you!
11-14-2024 07:57 AM
As per the information I can see internally is that the JDBC connection is not intended to support batch runs and there is actually a feature request open to have this capability working in the near future, does the cluster in 13.1 does no have any additional set up or spark config that can be causing this different behavior?
11-14-2024 08:27 AM
Hi @Walter_C
Thank you for the clarification. However, I’d like to emphasize that before version 14.1, specifically in version 13.1, insert operations were indeed performed in bulk. By "bulk operations," I mean that multiple SQL statements were executed within a single insert command, as shown in the examples below:
Version 14.1:
`INFO SparkExecuteStatementOperation: Submitting query 'INSERT INTO TEST_20241113 (ID, Age, Amt) VALUES (?, ?, ?)'`
Version 13.1:
`INSERT INTO vltn_gold.TEST_20241113 (ID, Age, Amt) VALUES (1.0, 25.0, 40000.0), (2.0, 30.0, 45000.0)`
As you can see, in version 13.1, multiple rows were inserted in a single statement, whereas in version 14.1, each row is inserted individually.
The Spark settings and cluster setup (aside from the different runtime versions) are identical on both clusters, so there’s no additional configuration or setup difference that could be causing this change in behavior.
Thank you
11-15-2024 08:57 AM
Were you using parameterized queries? I think its native support starts from version 14.1.
See if this can mitigate the issue by EnableNativeParameterizedQuery=0
11-18-2024 12:53 AM
Dear @Walter_C,
Thank you for your time and patience.
Using the suggested parameter in the JDBC URL, the insert behavior has returned to normal, and everything works perfectly. However, I have not been able to find any references to this parameter in the official documentation. I have reviewed the Databricks JDBC Driver Installation and Configuration Guide but couldn’t locate any mention of it.
Could you please indicate where I can find information about this parameter?
Thank you once again for your help!
Best regards,
Carmine
11-22-2024 12:14 AM
Dear @Walter_C ,
After an initial test with a moderate number of rows, we believed it to be the optimal solution.
However, upon increasing the row count beyond a million, we realized the dataset is divided into chunks of unclear size. Sometimes chunks are created with 25k rows, while other times with 20k rows.
Would it be possible to provide a guide on the parameters that can be used to manage chunk sizes effectively?
How I said, here we can't find anything: Databricks JDBC Driver Installation and Configuration Guide
Thank you!
05-16-2025 07:02 AM
Hello everyone! I'm getting this same issue with DBR 16.4 and Databricks JDBC OSS 1.0.5
https://docs.databricks.com/aws/en/integrations/jdbc-oss
Is there any config param that I can use to be able to get multiple rows inserts instead of this row by row behavior?
Thanks in advance
05-16-2025 07:05 AM
@Walter_C this is the format of the JDBC connection I'm using if it helps:
jdbc:databricks://host:443;HttpPath=path;ConnCatalog=catalog;ConnSchema=schema;SSL=1;AuthMech=3;UID=token;SocketTimeout=0;EnableArrow=1;UseThriftClient=0
We are using Serverless SQL Warehouse to perform the writes
05-16-2025 08:01 AM
Hi @JoseSoto ,
did you try the param EnableNativeParameterizedQuery=0 in the connection string?
05-16-2025 08:24 AM - edited 05-16-2025 08:24 AM
Hi @cotrariello !
I tried that, but that parameter is not available in the new Databricks JDBC OSS driver. Unfortunately, we can't use the Simba JDBC driver. This is the one we are using:
3 weeks ago
I've implemented a fix for this in the Databricks JDBC OSS driver and am looking to get it merged: https://github.com/databricks/databricks-jdbc/pull/943
Sunday
The PR looks closed for now, any news on this? Also, I was wondering if there's a code snippet available on using this functionality
Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!
Sign Up Now