<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Advanced Technique in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/advanced-technique/m-p/130405#M48781</link>
    <description>&lt;H1&gt;Reduced Monthly Databricks Bill from $47K to $12.7K&lt;/H1&gt;&lt;P class=""&gt;&lt;STRONG&gt;The Problem:&lt;/STRONG&gt; We were scanning 2.3TB for queries needing only 8GB of data.&lt;/P&gt;&lt;H2&gt;Three Quick Wins&lt;/H2&gt;&lt;H3&gt;1. Multi-dimensional Partitioning (30% savings)&lt;/H3&gt;&lt;P&gt;# Before&lt;BR /&gt;df.write.partitionBy("date").parquet(path)&lt;/P&gt;&lt;P&gt;# After-partition by multiple columns&lt;BR /&gt;df.repartition("region", "date") \&lt;BR /&gt;.sortWithinPartitions("customer_id") \&lt;BR /&gt;.write.partitionBy("region", "date").parquet(path)&lt;BR /&gt;2. Add Zonemap Index (35% additional savings)&lt;BR /&gt;# Build index on high-cardinality columns only&lt;BR /&gt;selective_cols = df.columns. filter(lambda c: df.select(c). distinct(). count() &amp;gt; 100)&lt;BR /&gt;create_zonemap(table_path, selective_cols)&lt;/P&gt;&lt;H3&gt;3. Query Rewriting (8% more savings)&lt;/H3&gt;&lt;P class=""&gt;Use file pruning to read only necessary files.&lt;/P&gt;&lt;H2&gt;Daily Cost Impact&lt;/H2&gt;&lt;UL class=""&gt;&lt;LI&gt;&lt;STRONG&gt;Before:&lt;/STRONG&gt; 847 DBU/day ($1,567)&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;After:&lt;/STRONG&gt; 223 DBU/day ($423)&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Monthly savings:&lt;/STRONG&gt; $34,300&lt;/LI&gt;&lt;/UL&gt;&lt;P class=""&gt;&lt;STRONG&gt;Key Learning:&lt;/STRONG&gt; Z-ordering actually INCREASED our costs by 12%. Targeted zone maps worked better for our access patterns.&lt;/P&gt;&lt;P class=""&gt;What's your biggest Databricks cost optimization win?&lt;/P&gt;</description>
    <pubDate>Mon, 01 Sep 2025 16:04:23 GMT</pubDate>
    <dc:creator>ck7007</dc:creator>
    <dc:date>2025-09-01T16:04:23Z</dc:date>
    <item>
      <title>Advanced Technique</title>
      <link>https://community.databricks.com/t5/data-engineering/advanced-technique/m-p/130405#M48781</link>
      <description>&lt;H1&gt;Reduced Monthly Databricks Bill from $47K to $12.7K&lt;/H1&gt;&lt;P class=""&gt;&lt;STRONG&gt;The Problem:&lt;/STRONG&gt; We were scanning 2.3TB for queries needing only 8GB of data.&lt;/P&gt;&lt;H2&gt;Three Quick Wins&lt;/H2&gt;&lt;H3&gt;1. Multi-dimensional Partitioning (30% savings)&lt;/H3&gt;&lt;P&gt;# Before&lt;BR /&gt;df.write.partitionBy("date").parquet(path)&lt;/P&gt;&lt;P&gt;# After-partition by multiple columns&lt;BR /&gt;df.repartition("region", "date") \&lt;BR /&gt;.sortWithinPartitions("customer_id") \&lt;BR /&gt;.write.partitionBy("region", "date").parquet(path)&lt;BR /&gt;2. Add Zonemap Index (35% additional savings)&lt;BR /&gt;# Build index on high-cardinality columns only&lt;BR /&gt;selective_cols = df.columns. filter(lambda c: df.select(c). distinct(). count() &amp;gt; 100)&lt;BR /&gt;create_zonemap(table_path, selective_cols)&lt;/P&gt;&lt;H3&gt;3. Query Rewriting (8% more savings)&lt;/H3&gt;&lt;P class=""&gt;Use file pruning to read only necessary files.&lt;/P&gt;&lt;H2&gt;Daily Cost Impact&lt;/H2&gt;&lt;UL class=""&gt;&lt;LI&gt;&lt;STRONG&gt;Before:&lt;/STRONG&gt; 847 DBU/day ($1,567)&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;After:&lt;/STRONG&gt; 223 DBU/day ($423)&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Monthly savings:&lt;/STRONG&gt; $34,300&lt;/LI&gt;&lt;/UL&gt;&lt;P class=""&gt;&lt;STRONG&gt;Key Learning:&lt;/STRONG&gt; Z-ordering actually INCREASED our costs by 12%. Targeted zone maps worked better for our access patterns.&lt;/P&gt;&lt;P class=""&gt;What's your biggest Databricks cost optimization win?&lt;/P&gt;</description>
      <pubDate>Mon, 01 Sep 2025 16:04:23 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/advanced-technique/m-p/130405#M48781</guid>
      <dc:creator>ck7007</dc:creator>
      <dc:date>2025-09-01T16:04:23Z</dc:date>
    </item>
    <item>
      <title>Re: Advanced Technique</title>
      <link>https://community.databricks.com/t5/data-engineering/advanced-technique/m-p/130406#M48782</link>
      <description>&lt;P&gt;hi &lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/180185"&gt;@ck7007&lt;/a&gt;&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;Isn't this a repeat of your previous post?&amp;nbsp;&lt;A href="https://community.databricks.com/t5/data-engineering/cost/td-p/130078" target="_blank"&gt;https://community.databricks.com/t5/data-engineering/cost/td-p/130078&lt;/A&gt;&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;What's the rationale around the repost &lt;span class="lia-unicode-emoji" title=":slightly_smiling_face:"&gt;🙂&lt;/span&gt;?&lt;BR /&gt;&lt;BR /&gt;All the best,&lt;BR /&gt;BS&lt;/P&gt;</description>
      <pubDate>Mon, 01 Sep 2025 16:25:14 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/advanced-technique/m-p/130406#M48782</guid>
      <dc:creator>BS_THE_ANALYST</dc:creator>
      <dc:date>2025-09-01T16:25:14Z</dc:date>
    </item>
    <item>
      <title>Re: Advanced Technique</title>
      <link>https://community.databricks.com/t5/data-engineering/advanced-technique/m-p/130442#M48797</link>
      <description>&lt;P class=""&gt;@BSGood catch—totally my mistake! 🤦Had multiple drafts open and posted the wrong one. Thanks for the heads-up!&lt;/P&gt;&lt;P class=""&gt;Just deleted the duplicate. What I meant to share was the Bloom filter follow-up that builds on that cost optimization:&lt;/P&gt;&lt;P class=""&gt;&lt;STRONG&gt;Quick update:&lt;/STRONG&gt; Adding Bloom filters to the zonemap strategy cut another $2.5K/month:&lt;/P&gt;&lt;UL class=""&gt;&lt;LI&gt;Zonemap alone: 73% file pruning&lt;/LI&gt;&lt;LI&gt;Zonemap + Bloom: 91% file pruning&lt;/LI&gt;&lt;LI&gt;Extra overhead: Only 35MB of memory&lt;/LI&gt;&lt;/UL&gt;&lt;P class=""&gt;The combo especially helps with JOIN performance—seeing an 89% reduction in shuffled data.&lt;/P&gt;&lt;P class=""&gt;Appreciate you keeping the community clean! Will be more careful with my posting workflow.&amp;nbsp;&lt;/P&gt;&lt;P class=""&gt;Anyone else accidentally post duplicates while juggling multiple optimization experiments?&lt;/P&gt;</description>
      <pubDate>Tue, 02 Sep 2025 06:19:04 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/advanced-technique/m-p/130442#M48797</guid>
      <dc:creator>ck7007</dc:creator>
      <dc:date>2025-09-02T06:19:04Z</dc:date>
    </item>
    <item>
      <title>Re: Advanced Technique</title>
      <link>https://community.databricks.com/t5/data-engineering/advanced-technique/m-p/130467#M48803</link>
      <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/180185"&gt;@ck7007&lt;/a&gt;&amp;nbsp;no worries.&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;I asked a question on the other thread: &lt;A href="https://community.databricks.com/t5/data-engineering/cost/td-p/130078" target="_blank"&gt;https://community.databricks.com/t5/data-engineering/cost/td-p/130078&lt;/A&gt; , I'm not sure if you're classing this thread as the duplicate or the other one so I'll repost.&lt;BR /&gt;&lt;BR /&gt;I didn't see you mention anything around Liquid Clustering&amp;nbsp;&lt;A href="https://docs.databricks.com/aws/en/delta/clustering" target="_blank"&gt;https://docs.databricks.com/aws/en/delta/clustering&lt;/A&gt;&amp;nbsp;was there a particular reason why? It was meant to replace Z-ORDER, if you did, I'd love to hear about what impact it had for your use case.&lt;BR /&gt;&lt;BR /&gt;All the best,&lt;BR /&gt;BS&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 02 Sep 2025 08:28:14 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/advanced-technique/m-p/130467#M48803</guid>
      <dc:creator>BS_THE_ANALYST</dc:creator>
      <dc:date>2025-09-02T08:28:14Z</dc:date>
    </item>
  </channel>
</rss>

