Databricks Community

JesseLancaster · ‎11-04-2022

Hello,

I'm trying to use Databricks on Azure with a Spark structured streaming job and an having very mysterious issue.

I boiled the job down it it's basics for testing, reading from a Kafka topic and writing to console in a forEachBatch.

On local, everything works fine indefinately.

On Databricks, the task terminates after just over 5 minutes with a "Cancelled" status.

There are no errors in the log, just this, which appears to be a graceful shutdown request of some kind, but I don't know where it's coming from

22/11/04 18:31:30 INFO DriverCorral$: Cleaning the wrapper ReplId-1ea30-8e4c0-48422-a (currently in status Running(ReplId-1ea30-8e4c0-48422-a,ExecutionId(job-774316032912321-run-84401-action-5645198327600153),RunnableCommandId(9102993760433650959)))
22/11/04 18:31:30 INFO DAGScheduler: Asked to cancel job group 2207618020913201706_9102993760433650959_job-774316032912321-run-84401-action-5645198327600153
22/11/04 18:31:30 INFO ScalaDriverLocal: cancelled jobGroup:2207618020913201706_9102993760433650959_job-774316032912321-run-84401-action-5645198327600153 
22/11/04 18:31:30 INFO ScalaDriverWrapper: Stopping streams for commandId pattern: CommandIdPattern(2207618020913201706,None,Some(job-774316032912321-run-84401-action-5645198327600153)).
22/11/04 18:31:30 INFO DatabricksStreamingQueryListener: Stopping the stream [id=d41eff2a-4de6-4f17-8d1c-659d1c1b8d98, runId=5bae9fb4-b5e1-45a0-af1e-a2f2553592c9]
22/11/04 18:31:30 INFO DAGScheduler: Asked to cancel job group 5bae9fb4-b5e1-45a0-af1e-a2f2553592c9
22/11/04 18:31:30 INFO TaskSchedulerImpl: Cancelling stage 366
22/11/04 18:31:30 INFO TaskSchedulerImpl: Killing all running tasks in stage 366: Stage cancelled
22/11/04 18:31:30 INFO MicroBatchExecution: QueryExecutionThread.interruptAndAwaitExecutionThreadTermination called with streaming query exit timeout=15000 ms

Any thoughts?

JesseLancaster · ‎11-09-2022

Kaniz,

Unfortunately that information is not useful.

1). I'm familiar with structured streaming and checkpoints, I've developed with spark for many years, just not on Databricks

2) This doesn't address the reason for the failure, a streaming job should run without interruption and not have to be restarted every 5 minutes

3) I tried setting up a retry policy, however it doesn't trigger (presumably because it's a cancellation according to the status not a failure) so even if I wanted to just restart the job every 5 minutes with a retry policy I cannot.