I'm using a Python UDF to apply OCR to each row of a dataframe which contains the URL to a PDF document. This is how I define my UDF:
def extract_text(url: str):
ocr = MyOcr(url)
extracted_text = ocr.get_text()
return json.dumps(extracted_text)
extract_text_udf = udf(lambda x: extract_text(x), StringType())
df2 = df2.withColumn('extracted_text', extract_text_udf(df2["url"]))
df2.display()
For each row, it could take about 1 minute for the OCR to finish processing. But everytime I invoke the UDF, it keeps running for a long time and eventually terminates with this error message "SparkException: Job aborted due to stage failure: Task 2 in stage 253.0 failed 4 times, most recent failure: Lost task 2.3 in stage 253.0 (TID 375) (10.139.64.15 executor 0): com.databricks.spark.safespark.UDFException: UNAVAILABLE: Channel shutdownNow invoked"
Why does this happen with this UDF alone? There are no issues executing other Python UDFs. And moreover, when I execute the same code outside a UDF, using a for loop for example, it works perfectly. I am not sure why the UDF keeps failing.
Could someone help me with this?
My Databricks runtime version: 13.2