Hi everyone,
I'm currently facing a significant performance issue when comparing the execution times of a query sent through JDBC versus a similar query executed through Databricks SQL (using Unity Catalog to access a federated SQL table).
JDBC Query:
jdbc_query = f"""
SELECT TOP 1 *
FROM db.schema.table
WHERE id = (
SELECT TOP 1 id
FROM db.schema.table2
)
AND model_id = {model_id}"""
Execution Time: ~2 seconds
Databricks SQL Query (UC):
Since Databricks SQL does not support TOP, I used LIMIT:
uc_query = f"""
SELECT *
FROM db.schema.table
WHERE id =
( SELECT id
FROM db.schema.table2
LIMIT 1 )
AND model_id = {model_id}
LIMIT 1
"""
Execution Time: 6-7 minutes
Additional Observations:
When I load and display each individual table (without applying any filters or subqueries), the time difference between JDBC and Databricks SQL is only 1-2 seconds.
The Question:
Given the significant time difference when running the combined query via Databricks SQL compared to JDBC, I'm trying to understand where these 6-7 minutes are lost.
Is this related to the conversion process from Databricks SQL to SQL Server SQL?
Could it be that the subquery or the overall optimization differs between how Databricks SQL and JDBC handle these queries?
Any insights, similar experiences, or suggestions on how to improve the performance of the Databricks SQL query would be greatly appreciated!
Thanks in advance!