โ06-12-2023 11:44 AM
I am applying a pandas UDF to a grouped dataframe in databricks. When I do this, a couple tasks hang forever, while the rest complete quickly.
I start by repartitioning my dataset so that each group is in one partition:
group_factors = ['a','b','c'] #masked for anonymity
model_df = (
df
.repartition(
num_cores, #partition into max number of cores on this compute
group_factors #partition by group so a group is always in same partition
)
)
I then group my dataset and apply the udf:
results = (
model_df #use repartitioned data
.groupBy(group_factors) #build groups
.applyInPandas(udf_tune, schema=result_schema) #apply in parallel
)
#write results table to store parameters
results.write.mode('overwrite').saveAsTable(table_name)
Spark then splits this into tasks equal to the number of partitions. It runs successfully for all but two tasks. Those two tasks do not throw errors, but instead hang until the timeout threshold on the job.
What is strange is that these groups/tasks do not appear to have any irregularities. The record size is similar to the other 58 completed tasks. The code does not throw any errors, so we don't have incorrectly typed or formatted data. Further, this command actually completes successfully about 20% of the time. But most days, we get caught on one or two hanging tasks that cause the job to fail.
The stderr simply notes that the task is hanging:
The stdout notes an allocation error (although all completed tasks contain the same allocation failure in their stdout files):
Any suggestions for how to avoid the hanging task issue?
P.S. When I reduce my data size (for example, splitting model_df into 4 smaller subsets, grouping and applying on each subset, and appending results) I do not run into this issue.
โ06-14-2023 12:42 AM
@Gary Buckleyโ :
The hanging tasks issue you're experiencing with the pandas UDF in Databricks can be caused by various factors. Here are a few suggestions to help you troubleshoot and potentially resolve the problem:
โ06-13-2023 04:56 AM
Is the stdout log from one of the executors where the task was running or from the driver?
โ06-13-2023 08:41 AM
Both the stdout and stderr are from the executor running the task
โ06-14-2023 12:42 AM
@Gary Buckleyโ :
The hanging tasks issue you're experiencing with the pandas UDF in Databricks can be caused by various factors. Here are a few suggestions to help you troubleshoot and potentially resolve the problem:
โ06-14-2023 11:44 PM
Hi @Gary Buckleyโ
Thank you for posting your question in our community! We are happy to assist you.
To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers your question?
This will also help other community members who may have similar questions in the future. Thank you for your participation and let us know if you need any further assistance!
โ06-15-2023 08:49 AM
Thank you Suteja. I had watched the resources and had never reached capacity for any. The data was evenly distributed across partitions and groups as well. I did end up taking your advice in (1). I set a timer and killed the process if the group took too long and just used default values in stead.
Thanks for the help.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group