cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Performance Issues with Writing Large DataFrames to Managed Tables in Databricks (3.5B+ Rows)

kanikvijay9
New Contributor III

Hi Community,

I'm working on a large-scale data processing job in Databricks and facing performance and stability issues during the write operations. Here's a detailed breakdown of my use case and environment:

Use Case Overview:

Primary Data Frames:

  1. First Data Frame: Created by reading from 4 external tables (each ~7B rows, 7–15 columns). Final DF size: 3.5B+ rows, mostly strings, nulls, and Booleans.
  2. Second Data Frame: Created from a managed table in Databricks catalog. Size: 3.8B+ rows, 10–15 columns.
  3. Processing Steps:

    • Writing both primary Data Frames to temporary managed tables.
    • Performing multi-key joins using Spark SQL.
    • Final join with another external table (~5B rows, 297.8GiB across 3444 files).
    • Final output is written to a managed Delta table using Liquid Clustering.

Cluster Configuration:

  • Databricks Runtime: 16.4 LTS (includes Apache Spark 3.5.2, Scala 2.12)
  • Driver: Standard_F32s_v4 (160 GB RAM, 32 Cores)
  • Worker: Standard_E32d_v4 (160 GB RAM, 32 Cores)
  • Min Workers: 2 | Max Workers: 18 | Current Workers: 6
  • Unity Catalog Enabled

Spark Configurations:

  • Modified Successfully:
kanikvijay9_0-1755015948307.png

 

  •  Attempted but Failed (Memory Error):
kanikvijay9_1-1755015978065.png

 

  • Error Message:
    • com.databricks.backend.cluster.IllegalSparkContainerMemoryException: INVALID_PARAMETER_VALUE: Specified heap memory (102400 MB) and off heap memory (92475 MB) is above the maximum executor memory (123300 MB) allowed for node type Standard_E20d_v4.

 

Write Logic:

kanikvijay9_0-1755016092143.png

 

What I’ve Tried:

  1. Caching/Persisting: Attempted to cache/persist intermediate Data Frames, but due to the massive size, it was not feasible and led to memory pressure.
  2. Partitioning During Write: Used partition By on multiple columns to optimize write performance, but still facing executor failures and long write times.

Issues Faced:

  1. Executor Failures: While writing the first primary Data Frame to temporary managed tables, 60%+ executors fail.
  2. Write Time: Writing intermediate Data Frames (each taking 8–10 hours) is extremely slow and unstable.
  3. Final Join: Joining with the 5B-row external table is compounding the performance issues.

Looking for Help On:

  • Best practices for writing large Data Frames efficiently.
  • Optimizing memory and executor configurations for large-scale joins and writes.
  • Suggestions for improving write performance with Delta + Liquid Clustering.
  • Any known limitations or tuning tips for Databricks Runtime 10.4 LTS with Spark 3.2.2.

Any insights, suggestions, or shared experiences would be greatly appreciated!

Thanks in advance 
Kanik Vijay

1 ACCEPTED SOLUTION

Accepted Solutions

kanikvijay9
New Contributor III
2 REPLIES 2

kanikvijay9
New Contributor III

szymon_dybczak
Esteemed Contributor III

Hi @kanikvijay9 ,

Thanks for sharing solution with us! Could you mark your answer as a solution to the thread? It helps other memebers find correct answer in a faster way 🙂