cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Databricks query performance when filtering on a column correlated to the partition-column

Erik
Valued Contributor II

(This is a copy of a question I asked on stackoverflow here, but maybe this community is a better fit for the question):

Setting: Delta-lake, Databricks SQL compute used by powerbi.

I am wondering about the following scenario: We have a column `timestamp` and a derived column `date` (which is the date of `timestamp`), and we choose to `partitionby` date. When we query we use `timestamp` in the filter, **not** date.

My understanding is that databrikcs a priori wont connect the timestamp and the date, and seemingly wont get any advantage of the partitioning. But since the files are in fact partitioned by timestamps (implicitly), when databricks looks at the min/max timestamps of all the files, it will find that it can skip most files after all. So it seems like we can get quite a benefit of partitioning even if its on a column we dont *explicitly* use in the query.

1. Is this correct, or have I missunderstood something?

2. What is the performance cost (roughly) of having to filter away files in this way vs using the partitioning directly.

3. Will databricks have all the min/max information in memory, or does it have to go out and look at the files for each query?

6 REPLIES 6

Hubert-Dudek
Esteemed Contributor III

In query I would just query first by date (generated from timestamp which we want to query) and than by exact timestamp, so it will use partitioning benefit.

Erik
Valued Contributor II

Thats a good idea, but I am wondering about the case where one does *not* include the partitioning column the queries. Image for example that we are giving "business people" access to the databricks sql endpoint through powerbi, and they are free to create dashboards as they want, completely withouth understanding of partitioning columns.

Hubert-Dudek
Esteemed Contributor III

hehe I have actually now the same problem. I though about creating ready dataset from data lake in PowerBI but only options which I see them is SQL or OneDrive files.

Hubert-Dudek
Esteemed Contributor III

I've just checked that you can load all csvs from datalake in your Power BI. Publish it and will create dataset in PowerBI which others can use (you can set reloads in cloud). I need yet to check to implement incremental reload. I am also not sure about space limitations in PowerBI.

Hubert-Dudek
Esteemed Contributor III

1GB per dataset for Pro, 100GB for Premium.

Erik
Valued Contributor II

Updated powerbi can import parquet files, which we find are significantly faster than csv. For small datasets this is fine, the problem is that with somewhat larger dataset, especially if one wants to join them, powerbi becomes extremely slow. Joins which takes 1-2 seconds in python can take hours in powerbi. But by using databricks SQL we can offload the joins to databricks, which is significantly faster.

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!