Databricks Community

Nidhi_Patni · ‎02-15-2026

In financial services, data never sleeps.

Trades flow in every second. Risk calculations refresh continuously. Regulatory reports demand precision. BI dashboards serve business users who expect sub-second responses.

And behind all of that?

A massive data platform processing billions of records daily.

When I joined one such platform, we were ingesting and transforming:

Billions of trade transactions

Intraday position updates

Risk and pricing feeds

Regulatory snapshots

Audit and reconciliation logs

The lakehouse had grown rapidly — but so had the Databricks bill.

Pipelines were slow. Clusters were scaling unpredictably. Shuffle was out of control. Small files were everywhere. Driver OOM errors were becoming “normal.”

Instead of throwing bigger clusters at the problem, we decided to fix the fundamentals.

Here’s how we approached performance optimization — and what actually moved the needle.

1. Fixing the Biggest Silent Killer: Inefficient Writes

Initially, incremental data was being appended blindly into Delta tables.

This caused:

Duplicate records

Expensive reconciliation logic

Full table scans during correction

Growing storage + compute costs

The Shift: Smart MERGE with Partition Pruning

We:

Partitioned large fact tables by trade_date

Used MERGE INTO instead of append

Included partition columns in the merge condition

MERGE INTO trades t

USING updates u

ON t.trade_id = u.trade_id

AND t.trade_date = u.trade_date

WHEN MATCHED THEN UPDATE

WHEN NOT MATCHED THEN INSERT

Why this matters:

Spark only scans relevant partitions

No full-table rewrite

No duplicates

Result:

Significant reduction in shuffle during merge

Write time reduced

One lesson: Never run MERGE on a massive unpartitioned table.

2. Designing Storage the Right Way (Partitioning + Z-Ordering)

Queries were filtering by account_id and instrument_id, but the layout didn’t support it.

We redesigned the storage:

Partitioned by trade_date (low cardinality, evenly distributed)

Z-Ordered by high-cardinality integer columns like account_id

OPTIMIZE trades ZORDER BY (account_id)

Why integer?
Better compression, faster sorting, more efficient file skipping than strings.

We also ensured partition-based filtering before reading or merging — reducing unnecessary scans.

Result:

Query latency drops

Fewer files scanned

Significant reduction in I/O costs

Good data layout is cheaper than bigger clusters.

3. Eliminating Recomputation with Cache & Persist

The same transformed dataset was being written to:

Delta table

Regulatory export

Audit store

Each action triggered a full recomputation.

We introduced:

df.persist()

df.count()

Then wrote to multiple targets.

Result:

Runtime reduction

Significant compute savings

More predictable job execution

In distributed systems, recomputation is invisible — but very expensive.

4. Replacing Python UDFs with Pandas UDFs

Risk normalization logic was implemented using standard Python UDFs.

Row-by-row execution was killing performance.

We replaced it with vectorized Pandas UDFs.

Result:

4x faster execution

Better CPU utilization

Reduced runtime by half.

If you’re still using standard Python UDFs for heavy transformations — you’re paying a hidden tax.

5. Broadcast Joins to Kill Shuffle

Large shuffles were happening when joining fact tables with small dimension tables.

Solution:

df.join(broadcast(dim_df), "instrument_id")

Result:

Eliminated GBs of shuffle

Join time reduced by 3X.

Massive reduction in shuffle-related DBU consumption

Shuffle is the most expensive operation in Spark. Avoid it when possible.

6. Enabling Photon + AQE

We enabled:

Photon engine

Adaptive Query Execution (AQE)

Auto shuffle partitions

Photon accelerated:

Aggregations

Window functions

Joins

AQE:

Automatically handled skew

Adjusted shuffle partitions dynamically

Result:

2–3x faster SQL queries

More stable job execution

Reduced compute waste

This was one of the highest ROI changes — no code rewrite required.

7. Smarter Ingestion with Auto Loader

Batch reads were reprocessing large datasets repeatedly.

We replaced them with Auto Loader:

Incremental ingestion

Schema evolution support

Efficient file discovery

Result:

Ingestion cost reduction

No more full dataset rescans

Better reliability for streaming feeds

In finance, incremental processing isn’t optional — it’s essential.

8. Cluster Strategy Based on Workload Behavior

Instead of blindly scaling up clusters, we studied Spark UI:

High I/O wait?

Spill to disk?

Skewed tasks?

Driver memory pressure?

Based on workload:

Used NVMe-backed instances when spill occurred

Enabled Databricks I/O cache for repeated queries

Right-sized drivers (2x memory when unavoidable collect operations were needed)

Used instance pools to reduce startup overhead

Result:

Reduced cluster startup time from 6 minutes to 1.5 minutes

Improved stability

Lower idle DBU consumption

Scaling without diagnosis is expensive.

9. Distributed Execution Instead of Driver Bottlenecks

We replaced:

ThreadPoolExecutor

Heavy collect() usage

With:

foreachPartition

toLocalIterator() when driver-side iteration was unavoidable

Result:

Eliminated driver OOM errors

Improved external API integration throughput

Stabilized reconciliation jobs

In Spark, the driver is not your compute engine. Executors are.

10. Liquid Clustering & Predictive Optimization

For evolving medium-sized datasets, we enabled:

Liquid clustering (instead of rigid partitioning)

Predictive optimization for automatic maintenance

Result:

Reduced manual OPTIMIZE/VACUUM overhead

More consistent query performance

Lower maintenance effort

Sometimes the best optimization is letting the engine optimize itself.

wesleyfelipe · ‎02-16-2026

Hi @Nidhi_Patni !

This is a very insightful post and has lots of good ideas! Congrats on the results!

I have one question: what was the trade-off between using traditional partitioning with Z-ordering versus liquid clustering?

Kirankumarbs · ‎02-16-2026

@Nidhi_Patni Thanks for production level information!

@wesleyfelipe
> I have one question: what was the trade-off between using traditional partitioning with Z-ordering versus liquid clustering?

Traditional partitioning with Z-ordering is kind of the old-school approach. Most people still stick with it because it works well for them (for example, most of my tables are still using it). We only started using liquid clustering about a month ago.

Also, keep in mind that liquid clustering requires DBR 13.3 or above. If you’re using structured streaming, you’ll need 15.4+. For us, upgrading to 15.4+ took a bit of time.

If you want to deep dive here you go