Databricks Community

Bharathi7 · ‎02-23-2024

I'm using a Python UDF to apply OCR to each row of a dataframe which contains the URL to a PDF document. This is how I define my UDF:

def extract_text(url: str):
    ocr = MyOcr(url)
    extracted_text = ocr.get_text()
    return json.dumps(extracted_text)

extract_text_udf = udf(lambda x: extract_text(x), StringType())

df2 = df2.withColumn('extracted_text', extract_text_udf(df2["url"]))
df2.display()

For each row, it could take about 1 minute for the OCR to finish processing. But everytime I invoke the UDF, it keeps running for a long time and eventually terminates with this error message "SparkException: Job aborted due to stage failure: Task 2 in stage 253.0 failed 4 times, most recent failure: Lost task 2.3 in stage 253.0 (TID 375) (10.139.64.15 executor 0): com.databricks.spark.safespark.UDFException: UNAVAILABLE: Channel shutdownNow invoked"

Why does this happen with this UDF alone? There are no issues executing other Python UDFs. And moreover, when I execute the same code outside a UDF, using a for loop for example, it works perfectly. I am not sure why the UDF keeps failing.
Could someone help me with this?

My Databricks runtime version: 13.2

daniel_sahal · ‎02-23-2024

@Bharathi7 It's really hard to determine what's going on without knowing what acutally MyOcr function does.

Maybe there's some kind of timeout on the service side? To many parallell connections?

Bharathi7 · ‎02-23-2024

@daniel_sahal thanks for your reply!

There is no timeout set on the service side actually. The same MyOcr function, when invoked outside a UDF has no issues processing the records. It only fails when the function is wrapped inside a UDF. So, I am guessing it has something to do with the Python UDF Timeout or some other issue. We did change the configuration parameter `spark.databricks.sql.execution.pythonUDFTimeout` to 200 seconds to allow the UDF to wait 200 seconds for each row before timing out. But that hasn't helped us either.

Maybe there's a different interpretation to the spark.databricks.sql.execution.pythonUDFTimeout? Our understanding was that this refers to timeout at the row level, i.e, wait 200 seconds before a record goes to fail state. However, we could not find details on this anywhere in the databricks documentation except on one of the other topics in this community

Again, we are not sure if this is because of the timeout either. It could be something entirely different too.

daniel_sahal · ‎03-04-2024

@Bharathi7

My thought was that running MyOcr function outside UDF could run it sequentially, while as UDF in parallell - this might cause the service to timeout due to a big amount of requests coming in.

Databricks Community

Python UDF fails with UNAVAILABLE: Channel shutdownNow invoked

Connect with Databricks Users in Your Area

Introducing an exclusively Databricks-hosted Assistant

How to present and share your Notebook insights in AI/BI Dashboards

Meet the Databricks MVPs

Now Hiring: Databricks Community Technical Moderator

Insights from a global survey of 1,100 technologists and interviews with 28 CIOs