cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Troubleshooting Spill

lawrence009
Contributor

I am trying to troubleshoot why spill occurred during DeltaOptimizeWrite. I am running a 64-core cluster with 256 GB RAM, which I expect to be handle this amount data (see attached DAG).

IMG_1085.jpeg

4 REPLIES 4

Finleycartwrigh
New Contributor II

Data Skewness: Some tasks might be processing more data than others. Incorrect Resource Allocation: Ensure that Spark configurations (like spark.executor.memory, spark.core etc.) are set appropriately. Complex Computations: The operations in the DAG might be too complex, causing excessive memory usage.

Kaniz
Community Manager
Community Manager

Hi @lawrence009Spill during DeltaOptimizeWrite can occur due to various reasons


- Possible issue: encountering a Java Heap space issue


- Troubleshooting steps:
 • Clarify the issue and collect details (Notebook URL, Cluster URL, Consent to run commands, Time duration, Executor log)
 • Identify the problem through Spark UI (look for Java.lang.outOfMemoryError: Java heap space)
 • Check driver logs for error messages (e.g., Spark Connector Worker: hit upload error)
 • Check executor logs for error message (in spark-executor/ip=<ip_address of the worker>/<executorId>/log4j file)
 • Analyze the stack trace to identify problematic steps in the code
 • Try a workaround if stack trace shows com.esotericsoftware.kryo.KryoException: java.lang.NegativeArraySizeException Serialization trace (increase spark.kryoserializer.buffer.max.mb)
 • Implement the solution by increasing spark.kryoserializer.buffer.max.mb as per requirement (refer to Spark Configuration documentation)

Tharun-Kumar
Honored Contributor II
Honored Contributor II

@lawrence009 

You can also take a look at the individual task level metrics. This should help in understanding whether there was skew involved during the processing. We can also get a better understanding of the spill by viewing the Task Level Summary. We record aggregated informations at min, 25th, 50th, 75th and max percentiles.

jose_gonzalez
Moderator
Moderator

You can resolver the Spill to memory by increasing the shuffle partitions, but 16 GB of spill memory should not create a major impact of your job execution. Could you share more details on the actual source code that you are running?

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.