Hi Kaniz,
Thank you for your comprehensive response, I appreciate it. I have not resolved the issue in my situation yet, but I am perhaps a little closer.
Basically, my Job is a 3-step chain of Tasks::
- Step 1 is a "set up" Task that queries metadata and so on to define the next step of tasks.
- Step 2 is 1 - n Tasks that run ingestion of groups of objects in parallel. Depends on step 1. The common Notebook in these Tasks uses dbutils.notebook.run() to call the Notebook containing the auto loader logic.
- Step 3 is basically the finishing Task to update logs and so on. Depends on step 2. This is designed to fail if a previous step 2 Task fails.
In my last test, I confirmed I had retries configured on every Task in the Job.
I was running ingestion for a new source, and expected schema changes between older files and now.
After it had run for some time, I saw that the 3rd finalising task was retrying, and that one of the step 2 tasks (where the auto loader code is) had failed - and had not retried.
So, I suspect the issue lies somewhere in the code around using notebook.run(), or within the Notebook called by it. Above you mention :
Notebook Execution and Retries:
- The notebook.run() approach should not inherently conflict with retries.
- However, consider the following:
- If the notebook containing the Auto Loader code fails, the retry behaviour depends on the overall job configuration.
- Verify that the retries are set at the job level and not overridden within the notebooks.
I wonder if that is what is happening here.
The implementation with notebook.run() follows, in pseudo code:
Step 2 "parent" task:
try:
runResult = dbutils.notebook.run( #parameters here )
except:
# code to log exception here
and then in the called notebook, there is this around .writeStream:
try:
streamWriter = (df.writeStream ...
.outputMode("append")
.option("mergeSchema", "true")
...
)
streamWriter.awaitTermination()
except:
#code to log exception here ...
dbutils.notebook.exit(json.dumps(logger.log_info))
Not sure if there is something significant in there; or that I know enough about the nuances of calling and exiting notebook runs to see what could be causing a problem here.
Appreciate it if anyone can provide any pearls of wisdom 😉