<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Effects of materialized view with Cluster BY in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/effects-of-materialized-view-with-cluster-by/m-p/150504#M53442</link>
    <description>&lt;P&gt;Thank you, Louis, for the swift response. That helps a lot!&lt;BR /&gt;Indeed the query on databricks is &amp;lt;1 minute, while the refresh time in PowerBI is about 12 minutes. So it is the data transfer/bandwith which is causing the issue.&lt;BR /&gt;All points clear, just one follow-up question.&lt;BR /&gt;I was looking into your suggestion #1 (pruning): What exactly should I look for? I saw that "Files read" is a bit less than the distinct values of `a` in that query and the percentage is about 30%.&lt;/P&gt;</description>
    <pubDate>Tue, 10 Mar 2026 17:17:41 GMT</pubDate>
    <dc:creator>malterializedvw</dc:creator>
    <dc:date>2026-03-10T17:17:41Z</dc:date>
    <item>
      <title>Effects of materialized view with Cluster BY</title>
      <link>https://community.databricks.com/t5/data-engineering/effects-of-materialized-view-with-cluster-by/m-p/150470#M53426</link>
      <description>&lt;P&gt;Hi folks,&lt;BR /&gt;I have a question on whether I am using materialized views right.&lt;BR /&gt;Our pipeline looks like this:&lt;BR /&gt;1. A spark job creates a table `&lt;EM&gt;source`&lt;/EM&gt; with columns a, b and c.&lt;BR /&gt;2. A materialized view &lt;EM&gt;`target`&amp;nbsp;&lt;/EM&gt; is created on &lt;EM&gt;`source`.&lt;/EM&gt; I want to partition it by `&lt;EM&gt;a&lt;/EM&gt;`, so I add a `&lt;EM&gt;CLUSTER BY&lt;/EM&gt;`.&lt;BR /&gt;3. PowerBI queries `&lt;EM&gt;target`&amp;nbsp;&lt;/EM&gt; by doing `&lt;EM&gt;select * from target where a is between x and y&lt;/EM&gt;`&amp;nbsp; (the reason why I want a partition)&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;CREATE MATERIALIZED VIEW IF NOT EXISTS target
            TBLPROPERTIES ('delta.columnMapping.mode' = 'name')
            TRIGGER ON UPDATE
    select
       a,
       b,
       sum(c)
    from source
    group by all
        CLUSTER BY (`a`)&lt;/LI-CODE&gt;&lt;P&gt;Unfortunately the PowerBi querying is still very slow (compared to a simple view without clustering).&lt;BR /&gt;But this could be a bandwith issue, bit hard to find out.&lt;BR /&gt;Two questions:&lt;BR /&gt;1. Is the definition of the materialized view correct?&lt;BR /&gt;2. Would it matter to add partitioning additionally in the source table (let's assume this table is not accessed otherwise)?&lt;BR /&gt;&lt;BR /&gt;For any clarifications or hints, please let me know.&lt;BR /&gt;Thanks and best regards&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 10 Mar 2026 10:02:48 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/effects-of-materialized-view-with-cluster-by/m-p/150470#M53426</guid>
      <dc:creator>malterializedvw</dc:creator>
      <dc:date>2026-03-10T10:02:48Z</dc:date>
    </item>
    <item>
      <title>Re: Effects of materialized view with Cluster BY</title>
      <link>https://community.databricks.com/t5/data-engineering/effects-of-materialized-view-with-cluster-by/m-p/150479#M53431</link>
      <description>&lt;P&gt;Greetings&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/219179"&gt;@malterializedvw&lt;/a&gt;&amp;nbsp;, I did some digging and have some helpful hints for your to consider as you work through your scenario.&amp;nbsp;&lt;/P&gt;
&lt;P class="p1"&gt;Your MV definition looks syntactically fine, but there are a few things I’d check.&lt;/P&gt;
&lt;P class="p1"&gt;First, &lt;SPAN class="s1"&gt;CLUSTER BY&lt;/SPAN&gt; on a materialized view applies liquid clustering to the materialized output, not the source table. Since you’re doing a &lt;SPAN class="s1"&gt;GROUP BY&lt;/SPAN&gt; with &lt;SPAN class="s1"&gt;sum(c)&lt;/SPAN&gt;, the MV is effectively becoming a pre-aggregated table clustered by &lt;SPAN class="s1"&gt;a&lt;/SPAN&gt;, which is the right general idea for a Power BI range-filter use case.&lt;/P&gt;
&lt;P class="p1"&gt;That said, liquid clustering is only helpful once the data has actually been physically reorganized. If the MV was just created or recently refreshed, the file layout may not yet reflect the clustering strategy. I’d verify whether compaction and clustering work have actually kicked in. You can check the table’s clustering metrics, or run &lt;SPAN class="s1"&gt;OPTIMIZE&lt;/SPAN&gt; on the MV directly to force the issue.&lt;/P&gt;
&lt;P class="p1"&gt;A couple of other things are worth looking at as well:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;
&lt;P class="p1"&gt;&lt;SPAN class="s1"&gt;delta.columnMapping.mode = 'name'&lt;/SPAN&gt; can introduce a bit of overhead and occasionally causes compatibility quirks with certain BI connectors. If you do not specifically need column rename/drop support, it may be worth removing that setting just to see whether behavior improves.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P class="p1"&gt;&lt;SPAN class="s1"&gt;TRIGGER ON UPDATE&lt;/SPAN&gt; simply means refreshes happen when explicitly triggered by your pipeline or process. That is fine, but make sure those refreshes are actually happening after source-table changes so the MV stays current.&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;P class="p1"&gt;On your second question: partitioning the source table is unlikely to help Power BI query performance in this case. The MV materializes its own data, so source-table layout mainly affects refresh efficiency, not read performance against the MV itself.&lt;/P&gt;
&lt;P class="p1"&gt;For the slow Power BI queries specifically, this is where I’d start:&lt;/P&gt;
&lt;OL start="1"&gt;
&lt;LI&gt;
&lt;P class="p1"&gt;Check the query profile in Databricks SQL and confirm whether file pruning is happening on the &lt;SPAN class="s1"&gt;a&lt;/SPAN&gt; filter. If pruning is not happening, the clustering likely is not effective yet.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P class="p1"&gt;Confirm whether Power BI is using DirectQuery or Import mode. DirectQuery adds translation and round-trip overhead on every interaction, which can become noticeable very quickly.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P class="p1"&gt;Keep the bandwidth point in mind too. Even with good clustering, if the filtered result set is still fairly large, data transfer itself may be the bottleneck.&lt;/P&gt;
&lt;/LI&gt;
&lt;/OL&gt;
&lt;P class="p1"&gt;Hope this helps, Louis.&lt;/P&gt;</description>
      <pubDate>Tue, 10 Mar 2026 11:43:47 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/effects-of-materialized-view-with-cluster-by/m-p/150479#M53431</guid>
      <dc:creator>Louis_Frolio</dc:creator>
      <dc:date>2026-03-10T11:43:47Z</dc:date>
    </item>
    <item>
      <title>Re: Effects of materialized view with Cluster BY</title>
      <link>https://community.databricks.com/t5/data-engineering/effects-of-materialized-view-with-cluster-by/m-p/150504#M53442</link>
      <description>&lt;P&gt;Thank you, Louis, for the swift response. That helps a lot!&lt;BR /&gt;Indeed the query on databricks is &amp;lt;1 minute, while the refresh time in PowerBI is about 12 minutes. So it is the data transfer/bandwith which is causing the issue.&lt;BR /&gt;All points clear, just one follow-up question.&lt;BR /&gt;I was looking into your suggestion #1 (pruning): What exactly should I look for? I saw that "Files read" is a bit less than the distinct values of `a` in that query and the percentage is about 30%.&lt;/P&gt;</description>
      <pubDate>Tue, 10 Mar 2026 17:17:41 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/effects-of-materialized-view-with-cluster-by/m-p/150504#M53442</guid>
      <dc:creator>malterializedvw</dc:creator>
      <dc:date>2026-03-10T17:17:41Z</dc:date>
    </item>
  </channel>
</rss>

