Databricks Community

Erik · ‎10-15-2021

(This is a copy of a question I asked on stackoverflow here, but maybe this community is a better fit for the question):

Setting: Delta-lake, Databricks SQL compute used by powerbi.

I am wondering about the following scenario: We have a column `timestamp` and a derived column `date` (which is the date of `timestamp`), and we choose to `partitionby` date. When we query we use `timestamp` in the filter, **not** date.

My understanding is that databrikcs a priori wont connect the timestamp and the date, and seemingly wont get any advantage of the partitioning. But since the files are in fact partitioned by timestamps (implicitly), when databricks looks at the min/max timestamps of all the files, it will find that it can skip most files after all. So it seems like we can get quite a benefit of partitioning even if its on a column we dont *explicitly* use in the query.

1. Is this correct, or have I missunderstood something?

2. What is the performance cost (roughly) of having to filter away files in this way vs using the partitioning directly.

3. Will databricks have all the min/max information in memory, or does it have to go out and look at the files for each query?

Hubert-Dudek · ‎10-15-2021

In query I would just query first by date (generated from timestamp which we want to query) and than by exact timestamp, so it will use partitioning benefit.

Erik · ‎10-17-2021

Thats a good idea, but I am wondering about the case where one does *not* include the partitioning column the queries. Image for example that we are giving "business people" access to the databricks sql endpoint through powerbi, and they are free to create dashboards as they want, completely withouth understanding of partitioning columns.

Hubert-Dudek · ‎10-17-2021

hehe I have actually now the same problem. I though about creating ready dataset from data lake in PowerBI but only options which I see them is SQL or OneDrive files.

Hubert-Dudek · ‎10-17-2021

I've just checked that you can load all csvs from datalake in your Power BI. Publish it and will create dataset in PowerBI which others can use (you can set reloads in cloud). I need yet to check to implement incremental reload. I am also not sure about space limitations in PowerBI.

Hubert-Dudek · ‎10-17-2021

1GB per dataset for Pro, 100GB for Premium.

Erik · ‎10-17-2021

Updated powerbi can import parquet files, which we find are significantly faster than csv. For small datasets this is fine, the problem is that with somewhat larger dataset, especially if one wants to join them, powerbi becomes extremely slow. Joins which takes 1-2 seconds in python can take hours in powerbi. But by using databricks SQL we can offload the joins to databricks, which is significantly faster.

Databricks Community

Databricks query performance when filtering on a column correlated to the partition-column

Connect with Databricks Users in Your Area

Virtual Learning Festival: 9 April - 30 April

Get Started With Lakehouse Architecture | Pass a quiz to earn your certificate completion.

Data + AI Summit 2025 — registration now open!

Databricks DevConnect: Global Community Meetups for Data Engineers

Databricks Community Champion - February 2025 - Stefan Koch