In financial services, data never sleeps.
Trades flow in every second. Risk calculations refresh continuously. Regulatory reports demand precision. BI dashboards serve business users who expect sub-second responses.
And behind all of that?
A massive data platform processing billions of records daily.
When I joined one such platform, we were ingesting and transforming:
- Billions of trade transactions
- Intraday position updates
- Audit and reconciliation logs
The lakehouse had grown rapidly โ but so had the Databricks bill.
Pipelines were slow. Clusters were scaling unpredictably. Shuffle was out of control. Small files were everywhere. Driver OOM errors were becoming โnormal.โ
Instead of throwing bigger clusters at the problem, we decided to fix the fundamentals.
Hereโs how we approached performance optimization โ and what actually moved the needle.
1. Fixing the Biggest Silent Killer: Inefficient Writes
Initially, incremental data was being appended blindly into Delta tables.
This caused:
- Expensive reconciliation logic
- Full table scans during correction
- Growing storage + compute costs
The Shift: Smart MERGE with Partition Pruning
We:
- Partitioned large fact tables by trade_date
- Used MERGE INTO instead of append
- Included partition columns in the merge condition
MERGE INTO trades t
USING updates u
ON t.trade_id = u.trade_id
AND t.trade_date = u.trade_date
WHEN MATCHED THEN UPDATE
WHEN NOT MATCHED THEN INSERT
Why this matters:
- Spark only scans relevant partitions
Result:
- Significant reduction in shuffle during merge
One lesson: Never run MERGE on a massive unpartitioned table.
2. Designing Storage the Right Way (Partitioning + Z-Ordering)
Queries were filtering by account_id and instrument_id, but the layout didnโt support it.
We redesigned the storage:
- Partitioned by trade_date (low cardinality, evenly distributed)
- Z-Ordered by high-cardinality integer columns like account_id
OPTIMIZE trades ZORDER BY (account_id)
Why integer?
Better compression, faster sorting, more efficient file skipping than strings.
We also ensured partition-based filtering before reading or merging โ reducing unnecessary scans.
Result:
- Significant reduction in I/O costs
Good data layout is cheaper than bigger clusters.
3. Eliminating Recomputation with Cache & Persist
The same transformed dataset was being written to:
Each action triggered a full recomputation.
We introduced:
df.persist()
df.count()
Then wrote to multiple targets.
Result:
- Significant compute savings
- More predictable job execution
In distributed systems, recomputation is invisible โ but very expensive.
4. Replacing Python UDFs with Pandas UDFs
Risk normalization logic was implemented using standard Python UDFs.
Row-by-row execution was killing performance.
We replaced it with vectorized Pandas UDFs.
Result:
If youโre still using standard Python UDFs for heavy transformations โ youโre paying a hidden tax.
5. Broadcast Joins to Kill Shuffle
Large shuffles were happening when joining fact tables with small dimension tables.
Solution:
df.join(broadcast(dim_df), "instrument_id")
Result:
- Eliminated GBs of shuffle
- Massive reduction in shuffle-related DBU consumption
Shuffle is the most expensive operation in Spark. Avoid it when possible.
6. Enabling Photon + AQE
We enabled:
- Adaptive Query Execution (AQE)
Photon accelerated:
AQE:
- Automatically handled skew
- Adjusted shuffle partitions dynamically
Result:
- 2โ3x faster SQL queries
- More stable job execution
This was one of the highest ROI changes โ no code rewrite required.
7. Smarter Ingestion with Auto Loader
Batch reads were reprocessing large datasets repeatedly.
We replaced them with Auto Loader:
Result:
- No more full dataset rescans
- Better reliability for streaming feeds
In finance, incremental processing isnโt optional โ itโs essential.
8. Cluster Strategy Based on Workload Behavior
Instead of blindly scaling up clusters, we studied Spark UI:
Based on workload:
- Used NVMe-backed instances when spill occurred
- Enabled Databricks I/O cache for repeated queries
- Right-sized drivers (2x memory when unavoidable collect operations were needed)
- Used instance pools to reduce startup overhead
Result:
- Reduced cluster startup time from 6 minutes to 1.5 minutes
- Lower idle DBU consumption
Scaling without diagnosis is expensive.
9. Distributed Execution Instead of Driver Bottlenecks
We replaced:
With:
- toLocalIterator() when driver-side iteration was unavoidable
Result:
- Eliminated driver OOM errors
- Improved external API integration throughput
- Stabilized reconciliation jobs
In Spark, the driver is not your compute engine. Executors are.
10. Liquid Clustering & Predictive Optimization
For evolving medium-sized datasets, we enabled:
- Liquid clustering (instead of rigid partitioning)
- Predictive optimization for automatic maintenance
Result:
- Reduced manual OPTIMIZE/VACUUM overhead
- More consistent query performance
Sometimes the best optimization is letting the engine optimize itself.