Python UDF fails with UNAVAILABLE: Channel shutdownNow invoked
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-23-2024 12:25 AM - edited 02-23-2024 12:26 AM
I'm using a Python UDF to apply OCR to each row of a dataframe which contains the URL to a PDF document. This is how I define my UDF:
def extract_text(url: str):
ocr = MyOcr(url)
extracted_text = ocr.get_text()
return json.dumps(extracted_text)
extract_text_udf = udf(lambda x: extract_text(x), StringType())
df2 = df2.withColumn('extracted_text', extract_text_udf(df2["url"]))
df2.display()
For each row, it could take about 1 minute for the OCR to finish processing. But everytime I invoke the UDF, it keeps running for a long time and eventually terminates with this error message "SparkException: Job aborted due to stage failure: Task 2 in stage 253.0 failed 4 times, most recent failure: Lost task 2.3 in stage 253.0 (TID 375) (10.139.64.15 executor 0): com.databricks.spark.safespark.UDFException: UNAVAILABLE: Channel shutdownNow invoked"
Why does this happen with this UDF alone? There are no issues executing other Python UDFs. And moreover, when I execute the same code outside a UDF, using a for loop for example, it works perfectly. I am not sure why the UDF keeps failing.
Could someone help me with this?
My Databricks runtime version: 13.2
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-23-2024 02:33 AM
@Bharathi7 It's really hard to determine what's going on without knowing what acutally MyOcr function does.
Maybe there's some kind of timeout on the service side? To many parallell connections?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-23-2024 02:54 AM
@daniel_sahal thanks for your reply!
There is no timeout set on the service side actually. The same MyOcr function, when invoked outside a UDF has no issues processing the records. It only fails when the function is wrapped inside a UDF. So, I am guessing it has something to do with the Python UDF Timeout or some other issue. We did change the configuration parameter `spark.databricks.sql.execution.pythonUDFTimeout` to 200 seconds to allow the UDF to wait 200 seconds for each row before timing out. But that hasn't helped us either.
Maybe there's a different interpretation to the spark.databricks.sql.execution.pythonUDFTimeout? Our understanding was that this refers to timeout at the row level, i.e, wait 200 seconds before a record goes to fail state. However, we could not find details on this anywhere in the databricks documentation except on one of the other topics in this community
Again, we are not sure if this is because of the timeout either. It could be something entirely different too.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-04-2024 12:40 AM
My thought was that running MyOcr function outside UDF could run it sequentially, while as UDF in parallell - this might cause the service to timeout due to a big amount of requests coming in.