10-15-2021 03:19 AM
(This is a copy of a question I asked on stackoverflow here, but maybe this community is a better fit for the question):
Setting: Delta-lake, Databricks SQL compute used by powerbi.
I am wondering about the following scenario: We have a column `timestamp` and a derived column `date` (which is the date of `timestamp`), and we choose to `partitionby` date. When we query we use `timestamp` in the filter, **not** date.
My understanding is that databrikcs a priori wont connect the timestamp and the date, and seemingly wont get any advantage of the partitioning. But since the files are in fact partitioned by timestamps (implicitly), when databricks looks at the min/max timestamps of all the files, it will find that it can skip most files after all. So it seems like we can get quite a benefit of partitioning even if its on a column we dont *explicitly* use in the query.
1. Is this correct, or have I missunderstood something?
2. What is the performance cost (roughly) of having to filter away files in this way vs using the partitioning directly.
3. Will databricks have all the min/max information in memory, or does it have to go out and look at the files for each query?
10-15-2021 06:54 AM
In query I would just query first by date (generated from timestamp which we want to query) and than by exact timestamp, so it will use partitioning benefit.
10-17-2021 04:03 AM
Thats a good idea, but I am wondering about the case where one does *not* include the partitioning column the queries. Image for example that we are giving "business people" access to the databricks sql endpoint through powerbi, and they are free to create dashboards as they want, completely withouth understanding of partitioning columns.
10-17-2021 04:09 AM
hehe I have actually now the same problem. I though about creating ready dataset from data lake in PowerBI but only options which I see them is SQL or OneDrive files.
10-17-2021 04:58 AM
I've just checked that you can load all csvs from datalake in your Power BI. Publish it and will create dataset in PowerBI which others can use (you can set reloads in cloud). I need yet to check to implement incremental reload. I am also not sure about space limitations in PowerBI.
10-17-2021 05:07 AM
1GB per dataset for Pro, 100GB for Premium.
10-17-2021 01:20 PM
Updated powerbi can import parquet files, which we find are significantly faster than csv. For small datasets this is fine, the problem is that with somewhat larger dataset, especially if one wants to join them, powerbi becomes extremely slow. Joins which takes 1-2 seconds in python can take hours in powerbi. But by using databricks SQL we can offload the joins to databricks, which is significantly faster.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group