cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Community Articles
Dive into a collaborative space where members like YOU can exchange knowledge, tips, and best practices. Join the conversation today and unlock a wealth of collective wisdom to enhance your experience and drive success.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

How I Reduced Databricks Costs in a High-Volume Financial Data Platform

Nidhi_Patni
Databricks Partner

In financial services, data never sleeps. 

Trades flow in every second. Risk calculations refresh continuously. Regulatory reports demand precision. BI dashboards serve business users who expect sub-second responses. 

And behind all of that? 

A massive data platform processing billions of records daily. 

When I joined one such platform, we were ingesting and transforming: 

  • Billions of trade transactions 
  • Intraday position updates 
  • Risk and pricing feeds 
  • Regulatory snapshots 
  • Audit and reconciliation logs 

The lakehouse had grown rapidly โ€” but so had the Databricks bill. 

Pipelines were slow. Clusters were scaling unpredictably. Shuffle was out of control. Small files were everywhere. Driver OOM errors were becoming โ€œnormal.โ€ 

Instead of throwing bigger clusters at the problem, we decided to fix the fundamentals. 

Hereโ€™s how we approached performance optimization โ€” and what actually moved the needle. 

1.  Fixing the Biggest Silent Killer: Inefficient Writes 

Initially, incremental data was being appended blindly into Delta tables. 

This caused: 

  • Duplicate records 
  • Expensive reconciliation logic 
  • Full table scans during correction 
  • Growing storage + compute costs 

The Shift: Smart MERGE with Partition Pruning 

We: 

  • Partitioned large fact tables by trade_date 
  • Used MERGE INTO instead of append 
  • Included partition columns in the merge condition 

MERGE INTO trades t 

USING updates u 

ON t.trade_id = u.trade_id 

AND t.trade_date = u.trade_date 

WHEN MATCHED THEN UPDATE 

WHEN NOT MATCHED THEN INSERT 

Why this matters: 

  • Spark only scans relevant partitions 
  • No full-table rewrite 
  • No duplicates 

Result: 

  • Significant reduction in shuffle during merge 
  • Write time reduced

One lesson: Never run MERGE on a massive unpartitioned table. 

2. Designing Storage the Right Way (Partitioning + Z-Ordering) 

Queries were filtering by account_id and instrument_id, but the layout didnโ€™t support it. 

We redesigned the storage: 

  • Partitioned by trade_date (low cardinality, evenly distributed) 
  • Z-Ordered by high-cardinality integer columns like account_id 

OPTIMIZE trades ZORDER BY (account_id) 

Why integer? 
Better compression, faster sorting, more efficient file skipping than strings. 

We also ensured partition-based filtering before reading or merging โ€” reducing unnecessary scans. 

Result: 

  • Query latency drops
  • Fewer files scanned 
  • Significant reduction in I/O costs 

Good data layout is cheaper than bigger clusters. 

 3. Eliminating Recomputation with Cache & Persist 

The same transformed dataset was being written to: 

  • Delta table 
  • Regulatory export 
  • Audit store 

Each action triggered a full recomputation. 

We introduced: 

df.persist() 

df.count()

Then wrote to multiple targets. 

Result: 

  • Runtime reduction 
  • Significant compute savings 
  • More predictable job execution 

In distributed systems, recomputation is invisible โ€” but very expensive. 

 4. Replacing Python UDFs with Pandas UDFs 

Risk normalization logic was implemented using standard Python UDFs. 

Row-by-row execution was killing performance. 

We replaced it with vectorized Pandas UDFs. 

Result: 

  • 4x faster execution 
  • Better CPU utilization 
  • Reduced runtime by half. 

If youโ€™re still using standard Python UDFs for heavy transformations โ€” youโ€™re paying a hidden tax. 

 5. Broadcast Joins to Kill Shuffle 

Large shuffles were happening when joining fact tables with small dimension tables. 

Solution: 

df.join(broadcast(dim_df), "instrument_id") 

Result: 

  • Eliminated GBs of shuffle 
  • Join time reduced by 3X. 
  • Massive reduction in shuffle-related DBU consumption 

Shuffle is the most expensive operation in Spark. Avoid it when possible. 

 6. Enabling Photon + AQE 

We enabled: 

  • Photon engine 
  • Adaptive Query Execution (AQE) 
  • Auto shuffle partitions 

Photon accelerated: 

  • Aggregations 
  • Window functions 
  • Joins 

AQE: 

  • Automatically handled skew 
  • Adjusted shuffle partitions dynamically 

Result: 

  • 2โ€“3x faster SQL queries 
  • More stable job execution 
  • Reduced compute waste 

This was one of the highest ROI changes โ€” no code rewrite required. 

 7. Smarter Ingestion with Auto Loader 

Batch reads were reprocessing large datasets repeatedly. 

We replaced them with Auto Loader: 

  • Incremental ingestion 
  • Schema evolution support 
  • Efficient file discovery 

Result: 

  • Ingestion cost reduction 
  • No more full dataset rescans 
  • Better reliability for streaming feeds 

In finance, incremental processing isnโ€™t optional โ€” itโ€™s essential. 

 8. Cluster Strategy Based on Workload Behavior 

Instead of blindly scaling up clusters, we studied Spark UI: 

  • High I/O wait? 
  • Spill to disk? 
  • Skewed tasks? 
  • Driver memory pressure? 

Based on workload: 

  • Used NVMe-backed instances when spill occurred 
  • Enabled Databricks I/O cache for repeated queries 
  • Right-sized drivers (2x memory when unavoidable collect operations were needed) 
  • Used instance pools to reduce startup overhead 

Result: 

  • Reduced cluster startup time from 6 minutes to 1.5 minutes 
  • Improved stability 
  • Lower idle DBU consumption 

Scaling without diagnosis is expensive. 

9. Distributed Execution Instead of Driver Bottlenecks 

We replaced: 

  • ThreadPoolExecutor 
  • Heavy collect() usage 

With: 

  • foreachPartition 
  • toLocalIterator() when driver-side iteration was unavoidable 

Result: 

  • Eliminated driver OOM errors 
  • Improved external API integration throughput 
  • Stabilized reconciliation jobs 

In Spark, the driver is not your compute engine. Executors are. 

10. Liquid Clustering & Predictive Optimization 

For evolving medium-sized datasets, we enabled: 

  • Liquid clustering (instead of rigid partitioning) 
  • Predictive optimization for automatic maintenance 

Result: 

  • Reduced manual OPTIMIZE/VACUUM overhead 
  • More consistent query performance 
  • Lower maintenance effort 

Sometimes the best optimization is letting the engine optimize itself. 

2 REPLIES 2

wesleyfelipe
Contributor

Hi @Nidhi_Patni !

This is a very insightful post and has lots of good ideas! Congrats on the results!

 

I have one question: what was the trade-off between using traditional partitioning with Z-ordering versus liquid clustering?

Kirankumarbs
Contributor III

@Nidhi_Patni Thanks for production level information!

@wesleyfelipe 
I have one question: what was the trade-off between using traditional partitioning with Z-ordering versus liquid clustering?

Traditional partitioning with Z-ordering is kind of the old-school approach. Most people still stick with it because it works well for them (for example, most of my tables are still using it). We only started using liquid clustering about a month ago.

Also, keep in mind that liquid clustering requires DBR 13.3 or above. If youโ€™re using structured streaming, youโ€™ll need 15.4+. For us, upgrading to 15.4+ took a bit of time.

If you want to deep dive here you go