Databricks Community

rgbuckley · ‎06-12-2023

I am applying a pandas UDF to a grouped dataframe in databricks. When I do this, a couple tasks hang forever, while the rest complete quickly.

I start by repartitioning my dataset so that each group is in one partition:

group_factors = ['a','b','c'] #masked for anonymity
 
model_df = (
    df
    .repartition(
        num_cores, #partition into max number of cores on this compute
        group_factors #partition by group so a group is always in same partition
        )
    )

I then group my dataset and apply the udf:

results = (
model_df #use repartitioned data
    .groupBy(group_factors) #build groups
    .applyInPandas(udf_tune, schema=result_schema) #apply in parallel
    )
 
#write results table to store parameters
results.write.mode('overwrite').saveAsTable(table_name)

Spark then splits this into tasks equal to the number of partitions. It runs successfully for all but two tasks. Those two tasks do not throw errors, but instead hang until the timeout threshold on the job.

What is strange is that these groups/tasks do not appear to have any irregularities. The record size is similar to the other 58 completed tasks. The code does not throw any errors, so we don't have incorrectly typed or formatted data. Further, this command actually completes successfully about 20% of the time. But most days, we get caught on one or two hanging tasks that cause the job to fail.

The stderr simply notes that the task is hanging:

The stdout notes an allocation error (although all completed tasks contain the same allocation failure in their stdout files):

Any suggestions for how to avoid the hanging task issue?

P.S. When I reduce my data size (for example, splitting model_df into 4 smaller subsets, grouping and applying on each subset, and appending results) I do not run into this issue.

Anonymous · ‎06-14-2023

@Gary Buckley :

The hanging tasks issue you're experiencing with the pandas UDF in Databricks can be caused by various factors. Here are a few suggestions to help you troubleshoot and potentially resolve the problem:

Increase the timeout: The hanging tasks might be related to longer processing times for specific groups. You can try increasing the timeout threshold for your job to give these tasks more time to complete before being considered as failed. You can set the timeout using the spark.databricks.sql.execution.pythonUDFTimeout configuration parameter.
Check for resource constraints: The hanging tasks could be a result of resource limitations on your compute cluster. Monitor the resource usage during the job execution, including CPU, memory, and disk utilization. If the hanging tasks consistently occur on specific nodes, it could indicate a resource constraint on those nodes. Consider adjusting the cluster configuration to allocate more resources to alleviate any bottlenecks.
Investigate data characteristics: Examine the data characteristics of the groups that experience hanging tasks. Look for any patterns or anomalies in the data that might cause performance issues. For example, if certain groups have significantly larger or more complex data, it could impact the execution time. Analyzing the data distribution and structure can help identify potential causes.
Check for data skew: Data skew occurs when a few groups have significantly more data than others, leading to imbalanced processing. Skewed data distribution can result in some tasks taking much longer to complete than others. Use Databricks' built-in data skew optimization techniques, such as the skew() function, to mitigate the impact of data skewness.

View solution in original post

Lakshay · ‎06-13-2023

Is the stdout log from one of the executors where the task was running or from the driver?

rgbuckley · ‎06-13-2023

Both the stdout and stderr are from the executor running the task

Anonymous · ‎06-14-2023

@Gary Buckley :

The hanging tasks issue you're experiencing with the pandas UDF in Databricks can be caused by various factors. Here are a few suggestions to help you troubleshoot and potentially resolve the problem:

Increase the timeout: The hanging tasks might be related to longer processing times for specific groups. You can try increasing the timeout threshold for your job to give these tasks more time to complete before being considered as failed. You can set the timeout using the spark.databricks.sql.execution.pythonUDFTimeout configuration parameter.
Check for resource constraints: The hanging tasks could be a result of resource limitations on your compute cluster. Monitor the resource usage during the job execution, including CPU, memory, and disk utilization. If the hanging tasks consistently occur on specific nodes, it could indicate a resource constraint on those nodes. Consider adjusting the cluster configuration to allocate more resources to alleviate any bottlenecks.
Investigate data characteristics: Examine the data characteristics of the groups that experience hanging tasks. Look for any patterns or anomalies in the data that might cause performance issues. For example, if certain groups have significantly larger or more complex data, it could impact the execution time. Analyzing the data distribution and structure can help identify potential causes.
Check for data skew: Data skew occurs when a few groups have significantly more data than others, leading to imbalanced processing. Skewed data distribution can result in some tasks taking much longer to complete than others. Use Databricks' built-in data skew optimization techniques, such as the skew() function, to mitigate the impact of data skewness.

Anonymous · ‎06-14-2023

Hi @Gary Buckley

Thank you for posting your question in our community! We are happy to assist you.

To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers your question?

This will also help other community members who may have similar questions in the future. Thank you for your participation and let us know if you need any further assistance!

rgbuckley · ‎06-15-2023

Thank you Suteja. I had watched the resources and had never reached capacity for any. The data was evenly distributed across partitions and groups as well. I did end up taking your advice in (1). I set a timer and killed the process if the group took too long and just used default values in stead.

Thanks for the help.