cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Troubleshooting Spill

lawrence009
Contributor

I am trying to troubleshoot why spill occurred during DeltaOptimizeWrite. I am running a 64-core cluster with 256 GB RAM, which I expect to be handle this amount data (see attached DAG).

IMG_1085.jpeg

4 REPLIES 4

Finleycartwrigh
New Contributor II

Data Skewness: Some tasks might be processing more data than others. Incorrect Resource Allocation: Ensure that Spark configurations (like spark.executor.memory, spark.core etc.) are set appropriately. Complex Computations: The operations in the DAG might be too complex, causing excessive memory usage.

Kaniz
Community Manager
Community Manager

Hi @lawrence009Spill during DeltaOptimizeWrite can occur due to various reasons


- Possible issue: encountering a Java Heap space issue


- Troubleshooting steps:
 • Clarify the issue and collect details (Notebook URL, Cluster URL, Consent to run commands, Time duration, Executor log)
 • Identify the problem through Spark UI (look for Java.lang.outOfMemoryError: Java heap space)
 • Check driver logs for error messages (e.g., Spark Connector Worker: hit upload error)
 • Check executor logs for error message (in spark-executor/ip=<ip_address of the worker>/<executorId>/log4j file)
 • Analyze the stack trace to identify problematic steps in the code
 • Try a workaround if stack trace shows com.esotericsoftware.kryo.KryoException: java.lang.NegativeArraySizeException Serialization trace (increase spark.kryoserializer.buffer.max.mb)
 • Implement the solution by increasing spark.kryoserializer.buffer.max.mb as per requirement (refer to Spark Configuration documentation)

Tharun-Kumar
Honored Contributor II
Honored Contributor II

@lawrence009 

You can also take a look at the individual task level metrics. This should help in understanding whether there was skew involved during the processing. We can also get a better understanding of the spill by viewing the Task Level Summary. We record aggregated informations at min, 25th, 50th, 75th and max percentiles.

jose_gonzalez
Moderator
Moderator

You can resolver the Spill to memory by increasing the shuffle partitions, but 16 GB of spill memory should not create a major impact of your job execution. Could you share more details on the actual source code that you are running?