cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Warehousing & Analytics
Engage in discussions on data warehousing, analytics, and BI solutions within the Databricks Community. Share insights, tips, and best practices for leveraging data for informed decision-making.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Delta Sharing Inquiry: Requests from Databricks SQL Editor

katiej02
New Contributor II

Hi, I am trying to test our delta sharing server using UC in our workspace.
We noticed that when we execute a query, for example: SELECT COUNT(1) FROM table_name WHERE col1 = 'value' , it sends two /query requests to our server. The first request has empty predicateHints and limitHints while the second request contains the predicateHints from there WHERE clause. What's the purpose of the first request? We were also wondering if we could ignore the first request. Thanks

2 REPLIES 2

Vidhi_Khaitan
Databricks Employee
Databricks Employee

This happens because Databricks' Delta Sharing client performs a two-phase query planning process:
First request sent without predicateHints or limitHints and its goal is to fetch the basic metadata -> the schema, partitioning info, statistics, etc.
Second request (actual query planning) is sent with predicateHints and limitHints and the actual data access request, incorporating pruning logic and filters.

Thus, ideally we should not ignore the first request.

katiej02
New Contributor II

Hi, we understand that the client performs a two-phase query planning process. In our case, the table has around 145,000 Parquet files, and we've observed that the first request becomes a significant bottleneck: the response body is large (655 MB) and takes ~10 seconds. We also tried supporting gzip response but it still takes 10 seconds before second (actual query) request is sent to server. The majority of those files are ultimately filtered out in the second request.

Is this response cached or reused in the client (e.g., Databricks Runtime), or is this metadata fetch expected to occur for every query? 

For simple queries like:
SELECT COUNT(1) FROM table WHERE part_col1 = 'val1'

Even though such queries will ultimately be pruned down to a small number of files, the protocol requires returning all 145,000 file entries in the initial response just for the client to begin planning.

Is this the expected performance behavior in such scenarios, or are there best practices to mitigate the overhead?

Appreciate your insights and design rationale here.

 

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local communityโ€”sign up today to get started!

Sign Up Now