cancel
Showing results for 
Search instead for 
Did you mean: 
Machine Learning
cancel
Showing results for 
Search instead for 
Did you mean: 

Fix Hanging Task in Databricks

rgbuckley
New Contributor III

I am applying a pandas UDF to a grouped dataframe in databricks. When I do this, a couple tasks hang forever, while the rest complete quickly.

I start by repartitioning my dataset so that each group is in one partition:

group_factors = ['a','b','c'] #masked for anonymity
 
model_df = (
    df
    .repartition(
        num_cores, #partition into max number of cores on this compute
        group_factors #partition by group so a group is always in same partition
        )
    )

I then group my dataset and apply the udf:

results = (
model_df #use repartitioned data
    .groupBy(group_factors) #build groups
    .applyInPandas(udf_tune, schema=result_schema) #apply in parallel
    )
 
#write results table to store parameters
results.write.mode('overwrite').saveAsTable(table_name)

Spark then splits this into tasks equal to the number of partitions. It runs successfully for all but two tasks. Those two tasks do not throw errors, but instead hang until the timeout threshold on the job.

Spark UI for compute cluster 

What is strange is that these groups/tasks do not appear to have any irregularities. The record size is similar to the other 58 completed tasks. The code does not throw any errors, so we don't have incorrectly typed or formatted data. Further, this command actually completes successfully about 20% of the time. But most days, we get caught on one or two hanging tasks that cause the job to fail.

The stderr simply notes that the task is hanging:

stderr for hanging task 

The stdout notes an allocation error (although all completed tasks contain the same allocation failure in their stdout files):

stdout for hanging task 

Any suggestions for how to avoid the hanging task issue?

P.S. When I reduce my data size (for example, splitting model_df into 4 smaller subsets, grouping and applying on each subset, and appending results) I do not run into this issue.

1 ACCEPTED SOLUTION

Accepted Solutions

Anonymous
Not applicable

@Gary Buckley​ :

The hanging tasks issue you're experiencing with the pandas UDF in Databricks can be caused by various factors. Here are a few suggestions to help you troubleshoot and potentially resolve the problem:

  1. Increase the timeout: The hanging tasks might be related to longer processing times for specific groups. You can try increasing the timeout threshold for your job to give these tasks more time to complete before being considered as failed. You can set the timeout using the spark.databricks.sql.execution.pythonUDFTimeout configuration parameter.
  2. Check for resource constraints: The hanging tasks could be a result of resource limitations on your compute cluster. Monitor the resource usage during the job execution, including CPU, memory, and disk utilization. If the hanging tasks consistently occur on specific nodes, it could indicate a resource constraint on those nodes. Consider adjusting the cluster configuration to allocate more resources to alleviate any bottlenecks.
  3. Investigate data characteristics: Examine the data characteristics of the groups that experience hanging tasks. Look for any patterns or anomalies in the data that might cause performance issues. For example, if certain groups have significantly larger or more complex data, it could impact the execution time. Analyzing the data distribution and structure can help identify potential causes.
  4. Check for data skew: Data skew occurs when a few groups have significantly more data than others, leading to imbalanced processing. Skewed data distribution can result in some tasks taking much longer to complete than others. Use Databricks' built-in data skew optimization techniques, such as the skew() function, to mitigate the impact of data skewness.

View solution in original post

5 REPLIES 5

Lakshay
Esteemed Contributor
Esteemed Contributor

Is the stdout log from one of the executors where the task was running or from the driver?

rgbuckley
New Contributor III

Both the stdout and stderr are from the executor running the task

Anonymous
Not applicable

@Gary Buckley​ :

The hanging tasks issue you're experiencing with the pandas UDF in Databricks can be caused by various factors. Here are a few suggestions to help you troubleshoot and potentially resolve the problem:

  1. Increase the timeout: The hanging tasks might be related to longer processing times for specific groups. You can try increasing the timeout threshold for your job to give these tasks more time to complete before being considered as failed. You can set the timeout using the spark.databricks.sql.execution.pythonUDFTimeout configuration parameter.
  2. Check for resource constraints: The hanging tasks could be a result of resource limitations on your compute cluster. Monitor the resource usage during the job execution, including CPU, memory, and disk utilization. If the hanging tasks consistently occur on specific nodes, it could indicate a resource constraint on those nodes. Consider adjusting the cluster configuration to allocate more resources to alleviate any bottlenecks.
  3. Investigate data characteristics: Examine the data characteristics of the groups that experience hanging tasks. Look for any patterns or anomalies in the data that might cause performance issues. For example, if certain groups have significantly larger or more complex data, it could impact the execution time. Analyzing the data distribution and structure can help identify potential causes.
  4. Check for data skew: Data skew occurs when a few groups have significantly more data than others, leading to imbalanced processing. Skewed data distribution can result in some tasks taking much longer to complete than others. Use Databricks' built-in data skew optimization techniques, such as the skew() function, to mitigate the impact of data skewness.

Anonymous
Not applicable

Hi @Gary Buckley​ 

Thank you for posting your question in our community! We are happy to assist you.

To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers your question?

This will also help other community members who may have similar questions in the future. Thank you for your participation and let us know if you need any further assistance! 

rgbuckley
New Contributor III

Thank you Suteja. I had watched the resources and had never reached capacity for any. The data was evenly distributed across partitions and groups as well. I did end up taking your advice in (1). I set a timer and killed the process if the group took too long and just used default values in stead.

Thanks for the help.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.