Databricks Community

kanikvijay9 · yesterday

Problem Statement

In one of my recent projects, I faced a significant challenge: Writing a huge dataset of 11,582,763,212 rows and 2,068 columns to a Databricks managed Delta table.

The initial write operation took 22.4 hours using the following setup:

Cluster Configuration:
- Driver: Standard_E4ads_v5 (4 cores, 32 GB)
- Workers: Standard_E4ads_v5 (4 cores, 32 GB), 2–10 autoscaling
Databricks Runtime: 15.4.28
Spark Configurations:

Why Was It So Slow?

Low Parallelism: spark.sql.shuffle.partitions=16 for billions of rows means each partition handled ~724M rows.
Cluster Underpowered: Even at 10 workers, only 40 cores for 11.5B rows and 2,068 columns.
Wide Rows: 2,068 columns caused huge shuffle size and memory pressure.
Delta Overhead: Auto-compaction during write added extra steps.

Optimization Strategy

1. Increase Shuffle Partitions

Reason: More partitions → smaller chunks → better parallelism → less skew.

2. Partition the Delta Table

Reason: Reduces file size per partition and improves query performance.

3. Adjust Cluster Configuration

Reason: Handles massive shuffle and sort for wide rows.

Recommended: 8–12 workers of Standard_E8ads_v5 (8 cores, 64 GB each)
Total: 64–96 cores, 512–768 GB memory

4. Disable Auto-Compact During Initial Load

Reason: Avoids extra compaction steps during heavy write.

5. Post-Write Optimization Workflow

Reason: Compacts small files and improves query performance.

Write Data: Focus on efficient partitioning and parallelism
Optimize: spark.sql("OPTIMIZE table_name ZORDER BY (important_columns)")
Vacuum: spark.sql("VACUUM table_name RETAIN 168 HOURS")

Why This Order?

Combining write, optimize, and vacuum in one job creates a huge DAG with multiple shuffles and risks OOM. Splitting them into separate jobs:

Write → efficient distribution
Optimize → file compaction
Vacuum → cleanup

Expected Impact

Original runtime: 35+ hours
After optimization: 10–12 hours (with better cluster and configs)
For 11B rows × 2,068 columns: With chunked writes and upgraded cluster → 8–12 hours instead of days

Key Takeaways

Parallelism and partitioning are critical for large-scale writes.
Cluster sizing matters more than you think.
Separate write, optimize, and vacuum for better performance and smaller DAGs.
Disable auto-compaction during initial load and run OPTIMIZE later.

Louis_Frolio · yesterday

@kanikvijay9 , Really great post. Dropping runtime from 22.4 hours to 8–12 is no small feat — that’s some serious optimization work. A few thoughts that might take it even further:

Let’s start with Adaptive Query Execution (AQE). If it’s not already in play, definitely give it a look. AQE can dynamically fine-tune shuffle partitions at runtime using actual data stats, which often saves a ton of manual trial and error.

Then there’s Column Pruning. With over two thousand columns, it’s worth analyzing which sets are most frequently queried together. If patterns emerge, you might consider splitting into a few narrower tables. That can make queries more efficient and easier to manage.

And for Databricks Runtime 13.3+, Liquid Clustering is a game-changer. It handles high-cardinality columns gracefully and removes the need for manual ZORDERing — one less maintenance headache to worry about.

Out of curiosity, which column(s) did you land on for partitioning the Delta table? That choice alone can make or break both write throughput and read performance.

Cheers, Louis.

kanikvijay9 · yesterday

Hey @Louis_Frolio ,

Thank you for the thoughtful feedback and great suggestions!

A few clarifications:

AQE is already enabled in my setup, and it definitely helped reduce shuffle overhead during the write.
Regarding Column Pruning, in this case, the final output requires all 2,068 columns to be written to the managed table, so splitting into narrower tables isn’t an option for this workload.
I completely agree on Liquid Clustering—with such high cardinality and large data volumes, it’s a strong candidate for future optimization. Removing manual ZORDER maintenance would be a big win.

As for your question on partitioning: I used region-based partitioning since it aligns well with query patterns and helps balance file sizes across partitions.

Appreciate your insights—these are excellent considerations for anyone tackling similar large-scale Delta writes!

Cheers,
Kanik