cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Effects of materialized view with Cluster BY

malterializedvw
New Contributor II

Hi folks,
I have a question on whether I am using materialized views right.
Our pipeline looks like this:
1. A spark job creates a table `source` with columns a, b and c.
2. A materialized view `target`  is created on `source`. I want to partition it by `a`, so I add a `CLUSTER BY`.
3. PowerBI queries `target`  by doing `select * from target where a is between x and y`  (the reason why I want a partition)

CREATE MATERIALIZED VIEW IF NOT EXISTS target
            TBLPROPERTIES ('delta.columnMapping.mode' = 'name')
            TRIGGER ON UPDATE
    select
       a,
       b,
       sum(c)
    from source
    group by all
        CLUSTER BY (`a`)

Unfortunately the PowerBi querying is still very slow (compared to a simple view without clustering).
But this could be a bandwith issue, bit hard to find out.
Two questions:
1. Is the definition of the materialized view correct?
2. Would it matter to add partitioning additionally in the source table (let's assume this table is not accessed otherwise)?

For any clarifications or hints, please let me know.
Thanks and best regards


1 ACCEPTED SOLUTION

Accepted Solutions

Louis_Frolio
Databricks Employee
Databricks Employee

Greetings @malterializedvw , I did some digging and have some helpful hints for your to consider as you work through your scenario. 

Your MV definition looks syntactically fine, but there are a few things I’d check.

First, CLUSTER BY on a materialized view applies liquid clustering to the materialized output, not the source table. Since you’re doing a GROUP BY with sum(c), the MV is effectively becoming a pre-aggregated table clustered by a, which is the right general idea for a Power BI range-filter use case.

That said, liquid clustering is only helpful once the data has actually been physically reorganized. If the MV was just created or recently refreshed, the file layout may not yet reflect the clustering strategy. I’d verify whether compaction and clustering work have actually kicked in. You can check the table’s clustering metrics, or run OPTIMIZE on the MV directly to force the issue.

A couple of other things are worth looking at as well:

  • delta.columnMapping.mode = 'name' can introduce a bit of overhead and occasionally causes compatibility quirks with certain BI connectors. If you do not specifically need column rename/drop support, it may be worth removing that setting just to see whether behavior improves.

  • TRIGGER ON UPDATE simply means refreshes happen when explicitly triggered by your pipeline or process. That is fine, but make sure those refreshes are actually happening after source-table changes so the MV stays current.

On your second question: partitioning the source table is unlikely to help Power BI query performance in this case. The MV materializes its own data, so source-table layout mainly affects refresh efficiency, not read performance against the MV itself.

For the slow Power BI queries specifically, this is where I’d start:

  1. Check the query profile in Databricks SQL and confirm whether file pruning is happening on the a filter. If pruning is not happening, the clustering likely is not effective yet.

  2. Confirm whether Power BI is using DirectQuery or Import mode. DirectQuery adds translation and round-trip overhead on every interaction, which can become noticeable very quickly.

  3. Keep the bandwidth point in mind too. Even with good clustering, if the filtered result set is still fairly large, data transfer itself may be the bottleneck.

Hope this helps, Louis.

View solution in original post

2 REPLIES 2

Louis_Frolio
Databricks Employee
Databricks Employee

Greetings @malterializedvw , I did some digging and have some helpful hints for your to consider as you work through your scenario. 

Your MV definition looks syntactically fine, but there are a few things I’d check.

First, CLUSTER BY on a materialized view applies liquid clustering to the materialized output, not the source table. Since you’re doing a GROUP BY with sum(c), the MV is effectively becoming a pre-aggregated table clustered by a, which is the right general idea for a Power BI range-filter use case.

That said, liquid clustering is only helpful once the data has actually been physically reorganized. If the MV was just created or recently refreshed, the file layout may not yet reflect the clustering strategy. I’d verify whether compaction and clustering work have actually kicked in. You can check the table’s clustering metrics, or run OPTIMIZE on the MV directly to force the issue.

A couple of other things are worth looking at as well:

  • delta.columnMapping.mode = 'name' can introduce a bit of overhead and occasionally causes compatibility quirks with certain BI connectors. If you do not specifically need column rename/drop support, it may be worth removing that setting just to see whether behavior improves.

  • TRIGGER ON UPDATE simply means refreshes happen when explicitly triggered by your pipeline or process. That is fine, but make sure those refreshes are actually happening after source-table changes so the MV stays current.

On your second question: partitioning the source table is unlikely to help Power BI query performance in this case. The MV materializes its own data, so source-table layout mainly affects refresh efficiency, not read performance against the MV itself.

For the slow Power BI queries specifically, this is where I’d start:

  1. Check the query profile in Databricks SQL and confirm whether file pruning is happening on the a filter. If pruning is not happening, the clustering likely is not effective yet.

  2. Confirm whether Power BI is using DirectQuery or Import mode. DirectQuery adds translation and round-trip overhead on every interaction, which can become noticeable very quickly.

  3. Keep the bandwidth point in mind too. Even with good clustering, if the filtered result set is still fairly large, data transfer itself may be the bottleneck.

Hope this helps, Louis.

Thank you, Louis, for the swift response. That helps a lot!
Indeed the query on databricks is <1 minute, while the refresh time in PowerBI is about 12 minutes. So it is the data transfer/bandwith which is causing the issue.
All points clear, just one follow-up question.
I was looking into your suggestion #1 (pruning): What exactly should I look for? I saw that "Files read" is a bit less than the distinct values of `a` in that query and the percentage is about 30%.