cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Databricks query performance when filtering on a column correlated to the partition-column

Erik
Valued Contributor II

(This is a copy of a question I asked on stackoverflow here, but maybe this community is a better fit for the question):

Setting: Delta-lake, Databricks SQL compute used by powerbi.

I am wondering about the following scenario: We have a column `timestamp` and a derived column `date` (which is the date of `timestamp`), and we choose to `partitionby` date. When we query we use `timestamp` in the filter, **not** date.

My understanding is that databrikcs a priori wont connect the timestamp and the date, and seemingly wont get any advantage of the partitioning. But since the files are in fact partitioned by timestamps (implicitly), when databricks looks at the min/max timestamps of all the files, it will find that it can skip most files after all. So it seems like we can get quite a benefit of partitioning even if its on a column we dont *explicitly* use in the query.

1. Is this correct, or have I missunderstood something?

2. What is the performance cost (roughly) of having to filter away files in this way vs using the partitioning directly.

3. Will databricks have all the min/max information in memory, or does it have to go out and look at the files for each query?

6 REPLIES 6

Hubert-Dudek
Esteemed Contributor III

In query I would just query first by date (generated from timestamp which we want to query) and than by exact timestamp, so it will use partitioning benefit.

Erik
Valued Contributor II

Thats a good idea, but I am wondering about the case where one does *not* include the partitioning column the queries. Image for example that we are giving "business people" access to the databricks sql endpoint through powerbi, and they are free to create dashboards as they want, completely withouth understanding of partitioning columns.

Hubert-Dudek
Esteemed Contributor III

hehe I have actually now the same problem. I though about creating ready dataset from data lake in PowerBI but only options which I see them is SQL or OneDrive files.

Hubert-Dudek
Esteemed Contributor III

I've just checked that you can load all csvs from datalake in your Power BI. Publish it and will create dataset in PowerBI which others can use (you can set reloads in cloud). I need yet to check to implement incremental reload. I am also not sure about space limitations in PowerBI.

Hubert-Dudek
Esteemed Contributor III

1GB per dataset for Pro, 100GB for Premium.

Erik
Valued Contributor II

Updated powerbi can import parquet files, which we find are significantly faster than csv. For small datasets this is fine, the problem is that with somewhat larger dataset, especially if one wants to join them, powerbi becomes extremely slow. Joins which takes 1-2 seconds in python can take hours in powerbi. But by using databricks SQL we can offload the joins to databricks, which is significantly faster.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.