Cost

ck7007 — Thu, 28 Aug 2025 18:45:13 GMT

Reduced Monthly Databricks Bill from $47K to $12.7K

The Problem: We were scanning 2.3TB for queries needing only 8GB of data.

Three Quick Wins

1. Multi-dimensional Partitioning (30% savings)

# Before
df.write.partitionBy("date").parquet(path)

# After-partition by multiple columns
df.repartition("region", "date") \
.sortWithinPartitions("customer_id") \
.write.partitionBy("region", "date").parquet(path)

2. Add Zonemap Index (35% additional savings)
# Build index on high-cardinality columns only
selective_cols = df.columns. filter(lambda c: df.select(c). distinct(). count() > 100)
create_zonemap(table_path, selective_cols)
3. Query Rewriting (8% more savings)

Use file pruning to read only necessary files.

Daily Cost Impact

Before: 847 DBU/day ($1,567)
After: 223 DBU/day ($423)
Monthly savings: $34,300

Key Learning: Z-ordering actually INCREASED our costs by 12%. Targeted zone maps worked better for our access patterns.

What's your biggest Databricks cost optimization win?

Re: Cost

BS_THE_ANALYST — Thu, 28 Aug 2025 18:57:42 GMT

@ck7007 thanks so much for sharing! That's such a saving, by the way. Congrats.

Out of curiosity, did you consider using Liquid Clustering which was meant to replace partitioning and z-order: https://docs.databricks.com/aws/en/delta/clustering

I found this part particularly interesting:

It provides the flexibility to redefine clustering keys without rewriting existing data, allowing data layout to evolve alongside analytic needs over time