Advanced Technique
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-01-2025 09:04 AM
Reduced Monthly Databricks Bill from $47K to $12.7K
The Problem: We were scanning 2.3TB for queries needing only 8GB of data.
Three Quick Wins
1. Multi-dimensional Partitioning (30% savings)
# Before
df.write.partitionBy("date").parquet(path)
# After-partition by multiple columns
df.repartition("region", "date") \
.sortWithinPartitions("customer_id") \
.write.partitionBy("region", "date").parquet(path)
2. Add Zonemap Index (35% additional savings)
# Build index on high-cardinality columns only
selective_cols = df.columns. filter(lambda c: df.select(c). distinct(). count() > 100)
create_zonemap(table_path, selective_cols)
3. Query Rewriting (8% more savings)
Use file pruning to read only necessary files.
Daily Cost Impact
- Before: 847 DBU/day ($1,567)
- After: 223 DBU/day ($423)
- Monthly savings: $34,300
Key Learning: Z-ordering actually INCREASED our costs by 12%. Targeted zone maps worked better for our access patterns.
What's your biggest Databricks cost optimization win?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-01-2025 09:25 AM
hi @ck7007
Isn't this a repeat of your previous post? https://community.databricks.com/t5/data-engineering/cost/td-p/130078
What's the rationale around the repost 🙂?
All the best,
BS
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-01-2025 11:19 PM
@BSGood catch—totally my mistake! 🤦Had multiple drafts open and posted the wrong one. Thanks for the heads-up!
Just deleted the duplicate. What I meant to share was the Bloom filter follow-up that builds on that cost optimization:
Quick update: Adding Bloom filters to the zonemap strategy cut another $2.5K/month:
- Zonemap alone: 73% file pruning
- Zonemap + Bloom: 91% file pruning
- Extra overhead: Only 35MB of memory
The combo especially helps with JOIN performance—seeing an 89% reduction in shuffled data.
Appreciate you keeping the community clean! Will be more careful with my posting workflow.
Anyone else accidentally post duplicates while juggling multiple optimization experiments?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-02-2025 01:28 AM
@ck7007 no worries.
I asked a question on the other thread: https://community.databricks.com/t5/data-engineering/cost/td-p/130078 , I'm not sure if you're classing this thread as the duplicate or the other one so I'll repost.
I didn't see you mention anything around Liquid Clustering https://docs.databricks.com/aws/en/delta/clustering was there a particular reason why? It was meant to replace Z-ORDER, if you did, I'd love to hear about what impact it had for your use case.
All the best,
BS