09-25-2023 02:08 PM
Trying to determine a root cause of UDFException that occurs when returning a variable length ArrayType. If I hardcode the data returned from the UDF to a fixed length, say 19, the error does not occur.
Setup code
split_runs_UDF = udf(split_runs_udf, ArrayType(StructType([StructField('split_run_id', StringType(), True), StructField('split_run_start_time', TimestampType(), True), StructField('split_run_end_time', TimestampType(), True)])))
The data gets subsequently `explode`d and then the `StructField`s are mapped to columns.
SparkException: Job aborted due to stage failure: Task 0 in stage 46.0 failed 4 times, most recent failure: Lost task 0.3 in stage 46.0 (TID 133) (10.0.17.11 executor 0): com.databricks.spark.safespark.UDFException: UNAVAILABLE: Channel shutdownNow invoked
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 46.0 failed 4 times, most recent failure: Lost task 0.3 in stage 46.0 (TID 133) (10.0.17.11 executor 0): com.databricks.spark.safespark.UDFException: UNAVAILABLE: Channel shutdownNow invoked
at com.databricks.spark.safespark.udf.UDFSession.handleException(UDFSession.scala:128)
at com.databricks.spark.safespark.udf.UDFSession$$anon$2.onError(UDFSession.scala:183)
at grpc_shaded.io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:478)
at grpc_shaded.io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:562)
at grpc_shaded.io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:70)
at grpc_shaded.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:743)
at grpc_shaded.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:722)
at grpc_shaded.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
at grpc_shaded.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750) Caused by: grpc_shaded.io.grpc.StatusRuntimeException: UNAVAILABLE: Channel shutdownNow invoked
at grpc_shaded.io.grpc.Status.asRuntimeException(Status.java:535)
... 10 more Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:3555)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:3487)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:3476)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:3476)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1493)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1493)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1493)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:3801)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3713)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3701)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:51)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$runJob$1(DAGScheduler.scala:1217)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at com.databricks.spark.util.FrameProfiler$.record(FrameProfiler.scala:94)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:1205)
at org.apache.spark.SparkContext.runJobInternal(SparkContext.scala:2946)
at org.apache.spark.sql.execution.collect.Collector.$anonfun$runSparkJobs$1(Collector.scala:338)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at com.databricks.spark.util.FrameProfiler$.record(FrameProfiler.scala:94)
at org.apache.spark.sql.execution.collect.Collector.runSparkJobs(Collector.scala:282)
at org.apache.spark.sql.execution.collect.Collector.$anonfun$collect$1(Collector.scala:366)
at com.databricks.spark.util.FrameProfiler$.record(FrameProfiler.scala:94)
at org.apache.spark.sql.execution.collect.Collector.collect(Collector.scala:363)
at org.apache.spark.sql.execution.collect.Collector$.collect(Collector.scala:117)
at org.apache.spark.sql.execution.collect.Collector$.collect(Collector.scala:124)
at org.apache.spark.sql.execution.qrc.InternalRowFormat$.collect(cachedSparkResults.scala:126)
at org.apache.spark.sql.execution.qrc.InternalRowFormat$.collect(cachedSparkResults.scala:114)
at org.apache.spark.sql.execution.qrc.InternalRowFormat$.collect(cachedSparkResults.scala:94)
at org.apache.spark.sql.execution.qrc.ResultCacheManager.$anonfun$computeResult$1(ResultCacheManager.scala:553)
at com.databricks.spark.util.FrameProfiler$.record(FrameProfiler.scala:94)
at org.apache.spark.sql.execution.qrc.ResultCacheManager.collectResult$1(ResultCacheManager.scala:545)
at org.apache.spark.sql.execution.qrc.ResultCacheManager.computeResult(ResultCacheManager.scala:565)
at org.apache.spark.sql.execution.qrc.ResultCacheManager.$anonfun$getOrComputeResultInternal$1(ResultCacheManager.scala:426)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.execution.qrc.ResultCacheManager.getOrComputeResultInternal(ResultCacheManager.scala:419)
at org.apache.spark.sql.execution.qrc.ResultCacheManager.getOrComputeResult(ResultCacheManager.scala:313)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeCollectResult$1(SparkPlan.scala:504)
at com.databricks.spark.util.FrameProfiler$.record(FrameProfiler.scala:94)
at org.apache.spark.sql.execution.SparkPlan.executeCollectResult(SparkPlan.scala:501)
at org.apache.spark.sql.Dataset.collectResult(Dataset.scala:3628)
at org.apache.spark.sql.Dataset.$anonfun$collectResult$1(Dataset.scala:3619)
at org.apache.spark.sql.Dataset.$anonfun$withAction$3(Dataset.scala:4544)
at org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:935)
at org.apache.spark.sql.Dataset.$anonfun$withAction$2(Dataset.scala:4542)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$8(SQLExecution.scala:274)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:498)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$1(SQLExecution.scala:201)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:1113)
at org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:151)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:447)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:4542)
at org.apache.spark.sql.Dataset.collectResult(Dataset.scala:3618)
at com.databricks.backend.daemon.driver.OutputAggregator$.withOutputAggregation0(OutputAggregator.scala:267)
at com.databricks.backend.daemon.driver.OutputAggregator$.withOutputAggregation(OutputAggregator.scala:101)
at com.databricks.backend.daemon.driver.PythonDriverLocalBase.generateTableResult(PythonDriverLocalBase.scala:773)
at com.databricks.backend.daemon.driver.JupyterDriverLocal.computeListResultsItem(JupyterDriverLocal.scala:1077)
at com.databricks.backend.daemon.driver.JupyterDriverLocal$JupyterEntryPoint.addCustomDisplayData(JupyterDriverLocal.scala:254)
at sun.reflect.GeneratedMethodAccessor616.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:397)
at py4j.Gateway.invoke(Gateway.java:306)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:195)
at py4j.ClientServerConnection.run(ClientServerConnection.java:115)
at java.lang.Thread.run(Thread.java:750)
09-26-2023 03:46 AM
Hi @zak_k Could you give more context on the usecase? Are you using this udf in a DLT pipeline? which dbr version you are using in your cluster?
09-26-2023 06:31 AM
It's not currently part of DLT in this early phase, but that is the end goal. DBR version is 13.3.
The use case is splitting some records into 2 or more records (when start and end time cross a shift work boundary).
09-26-2023 07:43 AM
I'm being told it only occurs on shared mode clusters, I'm going to confirm this now.
09-26-2023 10:40 AM
I have Confirmed that in small batches the UDF works on both single user and shared clusters, but with large data sets, it only works for single user clusters
09-27-2023 06:03 AM
After further investigation, It reproduces slightly differently on single user mode.
Single user mode: runs forever
Shared: gives the above message
I've determined that there was a corner case in the dataset which lead to UDF never returning. I am am assuming that on shared the udf runner is killed by a watchdog and that is why the error message is so cryptic (rather than "Timeout" type error)
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group