cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Python UDF fails with UNAVAILABLE: Channel shutdownNow invoked

Bharathi7
New Contributor II

I'm using a Python UDF to apply OCR to each row of a dataframe which contains the URL to a PDF document. This is how I define my UDF: 

 

def extract_text(url: str):
    ocr = MyOcr(url)
    extracted_text = ocr.get_text()
    return json.dumps(extracted_text)

extract_text_udf = udf(lambda x: extract_text(x), StringType())

df2 = df2.withColumn('extracted_text', extract_text_udf(df2["url"]))
df2.display()

 

 For each row, it could take about 1 minute for the OCR to finish processing. But everytime I invoke the UDF, it keeps running for a long time and eventually terminates with this error message "SparkException: Job aborted due to stage failure: Task 2 in stage 253.0 failed 4 times, most recent failure: Lost task 2.3 in stage 253.0 (TID 375) (10.139.64.15 executor 0): com.databricks.spark.safespark.UDFException: UNAVAILABLE: Channel shutdownNow invoked" 

Why does this happen with this UDF alone? There are no issues executing other Python UDFs. And moreover, when I execute the same code outside a UDF, using a for loop for example, it works perfectly. I am not sure why the UDF keeps failing. 
Could someone help me with this? 

My Databricks runtime version: 13.2

3 REPLIES 3

daniel_sahal
Honored Contributor III

@Bharathi7 It's really hard to determine what's going on without knowing what acutally MyOcr function does.

Maybe there's some kind of timeout on the service side? To many parallell connections?

@daniel_sahal thanks for your reply! 

There is no timeout set on the service side actually. The same MyOcr function, when invoked outside a UDF has no issues processing the records. It only fails when the function is wrapped inside a UDF. So, I am guessing it has something to do with the Python UDF Timeout or some other issue. We did change the configuration parameter `spark.databricks.sql.execution.pythonUDFTimeout` to 200 seconds to allow the UDF to wait 200 seconds for each row before timing out. But that hasn't helped us either. 

Maybe there's a different interpretation to the spark.databricks.sql.execution.pythonUDFTimeout? Our understanding was that this refers to timeout at the row level, i.e, wait 200 seconds before a record goes to fail state. However, we could not find details on this anywhere in the databricks documentation except on one of the other topics in this community

Again, we are not sure if this is because of the timeout either. It could be something entirely different too. 

daniel_sahal
Honored Contributor III

@Bharathi7 

My thought was that running MyOcr function outside UDF could run it sequentially, while as UDF in parallell - this might cause the service to timeout due to a big amount of requests coming in.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.