Hi Community,
I'm working on a large-scale data processing job in Databricks and facing performance and stability issues during the write operations. Here's a detailed breakdown of my use case and environment:
Use Case Overview:
Primary Data Frames:
- First Data Frame: Created by reading from 4 external tables (each ~7B rows, 7–15 columns). Final DF size: 3.5B+ rows, mostly strings, nulls, and Booleans.
- Second Data Frame: Created from a managed table in Databricks catalog. Size: 3.8B+ rows, 10–15 columns.
Processing Steps:
- Writing both primary Data Frames to temporary managed tables.
- Performing multi-key joins using Spark SQL.
- Final join with another external table (~5B rows, 297.8GiB across 3444 files).
- Final output is written to a managed Delta table using Liquid Clustering.
Cluster Configuration:
- Databricks Runtime: 16.4 LTS (includes Apache Spark 3.5.2, Scala 2.12)
- Driver: Standard_F32s_v4 (160 GB RAM, 32 Cores)
- Worker: Standard_E32d_v4 (160 GB RAM, 32 Cores)
- Min Workers: 2 | Max Workers: 18 | Current Workers: 6
- Unity Catalog Enabled
Spark Configurations:

- Attempted but Failed (Memory Error):

- Error Message:
- com.databricks.backend.cluster.IllegalSparkContainerMemoryException: INVALID_PARAMETER_VALUE: Specified heap memory (102400 MB) and off heap memory (92475 MB) is above the maximum executor memory (123300 MB) allowed for node type Standard_E20d_v4.
Write Logic:

What I’ve Tried:
- Caching/Persisting: Attempted to cache/persist intermediate Data Frames, but due to the massive size, it was not feasible and led to memory pressure.
- Partitioning During Write: Used partition By on multiple columns to optimize write performance, but still facing executor failures and long write times.
Issues Faced:
- Executor Failures: While writing the first primary Data Frame to temporary managed tables, 60%+ executors fail.
- Write Time: Writing intermediate Data Frames (each taking 8–10 hours) is extremely slow and unstable.
- Final Join: Joining with the 5B-row external table is compounding the performance issues.
Looking for Help On:
- Best practices for writing large Data Frames efficiently.
- Optimizing memory and executor configurations for large-scale joins and writes.
- Suggestions for improving write performance with Delta + Liquid Clustering.
- Any known limitations or tuning tips for Databricks Runtime 10.4 LTS with Spark 3.2.2.
Any insights, suggestions, or shared experiences would be greatly appreciated!
Thanks in advance
Kanik Vijay