Performance Issues with Writing Large DataFrames to Managed Tables in Databricks (3.5B+ Rows)

kanikvijay9 — Tue, 12 Aug 2025 16:30:43 GMT

Hi Community,

I'm working on a large-scale data processing job in Databricks and facing performance and stability issues during the write operations. Here's a detailed breakdown of my use case and environment:

Use Case Overview:

Primary Data Frames:

First Data Frame: Created by reading from 4 external tables (each ~7B rows, 7–15 columns). Final DF size: 3.5B+ rows, mostly strings, nulls, and Booleans.
Second Data Frame: Created from a managed table in Databricks catalog. Size: 3.8B+ rows, 10–15 columns.
Processing Steps:
- Writing both primary Data Frames to temporary managed tables.
- Performing multi-key joins using Spark SQL.
- Final join with another external table (~5B rows, 297.8GiB across 3444 files).
- Final output is written to a managed Delta table using Liquid Clustering.

Cluster Configuration:

Databricks Runtime: 16.4 LTS (includes Apache Spark 3.5.2, Scala 2.12)
Driver: Standard_F32s_v4 (160 GB RAM, 32 Cores)
Worker: Standard_E32d_v4 (160 GB RAM, 32 Cores)
Min Workers: 2 | Max Workers: 18 | Current Workers: 6
Unity Catalog Enabled

Spark Configurations:

Modified Successfully:

Attempted but Failed (Memory Error):

Error Message:
- com.databricks.backend.cluster.IllegalSparkContainerMemoryException: INVALID_PARAMETER_VALUE: Specified heap memory (102400 MB) and off heap memory (92475 MB) is above the maximum executor memory (123300 MB) allowed for node type Standard_E20d_v4.

Write Logic:

What I’ve Tried:

Caching/Persisting: Attempted to cache/persist intermediate Data Frames, but due to the massive size, it was not feasible and led to memory pressure.
Partitioning During Write: Used partition By on multiple columns to optimize write performance, but still facing executor failures and long write times.

Issues Faced:

Executor Failures: While writing the first primary Data Frame to temporary managed tables, 60%+ executors fail.
Write Time: Writing intermediate Data Frames (each taking 8–10 hours) is extremely slow and unstable.
Final Join: Joining with the 5B-row external table is compounding the performance issues.

Looking for Help On:

Best practices for writing large Data Frames efficiently.
Optimizing memory and executor configurations for large-scale joins and writes.
Suggestions for improving write performance with Delta + Liquid Clustering.
Any known limitations or tuning tips for Databricks Runtime 10.4 LTS with Spark 3.2.2.

Any insights, suggestions, or shared experiences would be greatly appreciated!

Thanks in advance
Kanik Vijay

Re: Performance Issues with Writing Large DataFrames to Managed Tables in Databricks (3.5B+ Rows)

szymon_dybczak — Tue, 19 Aug 2025 09:31:42 GMT

Hi @kanikvijay9 ,

Thanks for sharing solution with us! Could you mark your answer as a solution to the thread? It helps other memebers find correct answer in a faster way 🙂

topic Re: Performance Issues with Writing Large DataFrames to Managed Tables in Databricks (3.5B+ Rows) in Data Engineering

Performance Issues with Writing Large DataFrames to Managed Tables in Databricks (3.5B+ Rows)

Use Case Overview:

Primary Data Frames:

Write Logic:

What I’ve Tried:

Re: Performance Issues with Writing Large DataFrames to Managed Tables in Databricks (3.5B+ Rows)

Re: Performance Issues with Writing Large DataFrames to Managed Tables in Databricks (3.5B+ Rows)