cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Advanced Technique

ck7007
New Contributor II

Reduced Monthly Databricks Bill from $47K to $12.7K

The Problem: We were scanning 2.3TB for queries needing only 8GB of data.

Three Quick Wins

1. Multi-dimensional Partitioning (30% savings)

# Before
df.write.partitionBy("date").parquet(path)

# After-partition by multiple columns
df.repartition("region", "date") \
.sortWithinPartitions("customer_id") \
.write.partitionBy("region", "date").parquet(path)
2. Add Zonemap Index (35% additional savings)
# Build index on high-cardinality columns only
selective_cols = df.columns. filter(lambda c: df.select(c). distinct(). count() > 100)
create_zonemap(table_path, selective_cols)

3. Query Rewriting (8% more savings)

Use file pruning to read only necessary files.

Daily Cost Impact

  • Before: 847 DBU/day ($1,567)
  • After: 223 DBU/day ($423)
  • Monthly savings: $34,300

Key Learning: Z-ordering actually INCREASED our costs by 12%. Targeted zone maps worked better for our access patterns.

What's your biggest Databricks cost optimization win?

3 REPLIES 3

BS_THE_ANALYST
Honored Contributor III

hi @ck7007 

Isn't this a repeat of your previous post? https://community.databricks.com/t5/data-engineering/cost/td-p/130078 

What's the rationale around the repost ๐Ÿ™‚?

All the best,
BS

ck7007
New Contributor II

@BSGood catchโ€”totally my mistake! ๐ŸคฆHad multiple drafts open and posted the wrong one. Thanks for the heads-up!

Just deleted the duplicate. What I meant to share was the Bloom filter follow-up that builds on that cost optimization:

Quick update: Adding Bloom filters to the zonemap strategy cut another $2.5K/month:

  • Zonemap alone: 73% file pruning
  • Zonemap + Bloom: 91% file pruning
  • Extra overhead: Only 35MB of memory

The combo especially helps with JOIN performanceโ€”seeing an 89% reduction in shuffled data.

Appreciate you keeping the community clean! Will be more careful with my posting workflow. 

Anyone else accidentally post duplicates while juggling multiple optimization experiments?

BS_THE_ANALYST
Honored Contributor III

@ck7007 no worries. 

I asked a question on the other thread: https://community.databricks.com/t5/data-engineering/cost/td-p/130078 , I'm not sure if you're classing this thread as the duplicate or the other one so I'll repost.

I didn't see you mention anything around Liquid Clustering https://docs.databricks.com/aws/en/delta/clustering was there a particular reason why? It was meant to replace Z-ORDER, if you did, I'd love to hear about what impact it had for your use case.

All the best,
BS