<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Structured Streaming Auto Loader UnknownFieldsException and Workflow Retries in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/structured-streaming-auto-loader-unknownfieldsexception-and/m-p/58364#M31112</link>
    <description>&lt;P&gt;Another point I have realised, is that the task and the parent notebook (which then calls the child notebook that runs the auto loader part)&amp;nbsp;&lt;STRONG&gt;does not fail&lt;/STRONG&gt; if the schema-changed failure occurs during the auto loader process.&amp;nbsp; It's the child notebook and the job created by notebook.run() that fails.&amp;nbsp; Which suggests to me that this task needs to fail to trigger the automatic retry.&amp;nbsp; Except, the design here is that the parent notebook is an iterative orchestrator, it initiates the ingestion for a group of objects.&amp;nbsp; So if the first object fails, it needs to keep running to the next one, then the next.&amp;nbsp;&amp;nbsp;&lt;span class="lia-unicode-emoji" title=":thinking_face:"&gt;🤔&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;It's an issue that makes a paradigm change to DLT have more merit, as automatic retries are just part of it and doesn't need thinking about or coding around.&lt;/P&gt;</description>
    <pubDate>Wed, 24 Jan 2024 20:28:24 GMT</pubDate>
    <dc:creator>ilarsen</dc:creator>
    <dc:date>2024-01-24T20:28:24Z</dc:date>
    <item>
      <title>Structured Streaming Auto Loader UnknownFieldsException and Workflow Retries</title>
      <link>https://community.databricks.com/t5/data-engineering/structured-streaming-auto-loader-unknownfieldsexception-and/m-p/53412#M29790</link>
      <description>&lt;P&gt;Hi.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I am using structured streaming and auto loader to read json files, and it is automated by Workflow.&amp;nbsp; I am having difficulties with the job failing as schema changes are detected, but not retrying.&amp;nbsp; Hopefully someone can point me in the right direction?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;From what I understand from the documentation (&lt;A href="https://learn.microsoft.com/en-us/azure/databricks/ingestion/auto-loader/schema#--how-does-auto-loader-schema-evolution-work" target="_blank"&gt;Configure schema inference and evolution in Auto Loader - Azure Databricks | Microsoft Learn&lt;/A&gt;) is that the failure is expected behaviour.&amp;nbsp; I have enabled retries on the Job Task, but it does not appear that the retries are happening.&amp;nbsp; What could be a key point, is that the job task runs a notebook, which in turn runs another notebook containing the auto loader code by&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;dbutils.notebook.run()&lt;/LI-CODE&gt;&lt;P&gt;&lt;STRONG&gt;Could the notebook.run() approach be conflicting with the retry setting on the Task, and therefore no retry?&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;My stream reader has these options defined:&lt;/P&gt;&lt;LI-CODE lang="python"&gt;options = {
        "cloudFiles.useNotifications": "false",
        "cloudFiles.schemaLocation": &amp;lt;variable to checkpoint location here&amp;gt;,
        "cloudFiles.format": "json",
        "cloudFiles.includeExistingFiles": True,
        "cloudFiles.inferColumnTypes": True,
        "cloudFiles.useIncrementalListing": "true"
    }&lt;/LI-CODE&gt;&lt;P&gt;I notice that my options do not include&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;cloudFiles.schemaEvolutionMode&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;, but according to the docs that is default.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;My stream writer has this option defined:&lt;/P&gt;&lt;LI-CODE lang="python"&gt;.option("mergeSchema", "true")&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Does anyone have any ideas please?&lt;/P&gt;</description>
      <pubDate>Tue, 21 Nov 2023 23:05:41 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/structured-streaming-auto-loader-unknownfieldsexception-and/m-p/53412#M29790</guid>
      <dc:creator>ilarsen</dc:creator>
      <dc:date>2023-11-21T23:05:41Z</dc:date>
    </item>
    <item>
      <title>Re: Structured Streaming Auto Loader UnknownFieldsException and Workflow Retries</title>
      <link>https://community.databricks.com/t5/data-engineering/structured-streaming-auto-loader-unknownfieldsexception-and/m-p/58299#M31091</link>
      <description>&lt;P&gt;Hi Kaniz,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thank you for your comprehensive response, I appreciate it.&amp;nbsp; I have not resolved the issue in my situation yet, but I am perhaps a little closer.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Basically, my Job is a 3-step chain of Tasks::&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;Step 1 is a "set up" Task that queries metadata and so on to define the next step of tasks.&lt;/LI&gt;&lt;LI&gt;Step 2 is 1 -&amp;nbsp;&lt;EM&gt;n&lt;/EM&gt; Tasks that run ingestion of groups of objects in parallel.&amp;nbsp; Depends on step 1.&amp;nbsp; The common Notebook in these Tasks uses dbutils.notebook.run() to call the Notebook containing the auto loader logic.&lt;/LI&gt;&lt;LI&gt;Step 3 is basically the finishing Task to update logs and so on.&amp;nbsp; Depends on step 2.&amp;nbsp; This is designed to fail if a previous step 2 Task fails.&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;In my last test, I confirmed I had retries configured on every Task in the Job.&lt;/P&gt;&lt;P&gt;I was running ingestion for a new source, and expected schema changes between older files and now.&lt;/P&gt;&lt;P&gt;After it had run for some time, I saw that the 3rd finalising task was retrying, and that one of the step 2 tasks (where the auto loader code is) had failed - and had not retried.&lt;/P&gt;&lt;P&gt;So, I suspect the issue lies somewhere in the code around using notebook.run(), or within the Notebook called by it.&amp;nbsp; Above you mention :&lt;/P&gt;&lt;BLOCKQUOTE&gt;&lt;P&gt;&lt;STRONG&gt;Notebook Execution and Retries&lt;/STRONG&gt;:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;The notebook.run() approach should not inherently conflict with retries.&lt;/LI&gt;&lt;LI&gt;However, consider the following:&lt;UL&gt;&lt;LI&gt;If the notebook containing the Auto Loader code fails, the retry behaviour depends on the overall job configuration.&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Verify that the retries are set at the job level and not overridden within the notebooks.&lt;/STRONG&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;HR /&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I wonder if that is what is happening here.&lt;/P&gt;&lt;P&gt;The implementation with notebook.run() follows, in pseudo code:&lt;/P&gt;&lt;P&gt;Step 2 "parent" task:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;try:
runResult = dbutils.notebook.run( #parameters here )

except:
# code to log exception here&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;and then in the called notebook, there is this around .writeStream:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;try:
streamWriter = (df.writeStream ...
.outputMode("append")
.option("mergeSchema", "true")
...
)
streamWriter.awaitTermination()

except:
#code to log exception here ...
dbutils.notebook.exit(json.dumps(logger.log_info))&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Not sure if there is something significant in there; or that I know enough about the nuances of calling and exiting notebook runs to see what could be causing a problem here.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Appreciate it if anyone can provide any pearls of wisdom&amp;nbsp;&lt;span class="lia-unicode-emoji" title=":winking_face:"&gt;😉&lt;/span&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 24 Jan 2024 04:01:31 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/structured-streaming-auto-loader-unknownfieldsexception-and/m-p/58299#M31091</guid>
      <dc:creator>ilarsen</dc:creator>
      <dc:date>2024-01-24T04:01:31Z</dc:date>
    </item>
    <item>
      <title>Re: Structured Streaming Auto Loader UnknownFieldsException and Workflow Retries</title>
      <link>https://community.databricks.com/t5/data-engineering/structured-streaming-auto-loader-unknownfieldsexception-and/m-p/58364#M31112</link>
      <description>&lt;P&gt;Another point I have realised, is that the task and the parent notebook (which then calls the child notebook that runs the auto loader part)&amp;nbsp;&lt;STRONG&gt;does not fail&lt;/STRONG&gt; if the schema-changed failure occurs during the auto loader process.&amp;nbsp; It's the child notebook and the job created by notebook.run() that fails.&amp;nbsp; Which suggests to me that this task needs to fail to trigger the automatic retry.&amp;nbsp; Except, the design here is that the parent notebook is an iterative orchestrator, it initiates the ingestion for a group of objects.&amp;nbsp; So if the first object fails, it needs to keep running to the next one, then the next.&amp;nbsp;&amp;nbsp;&lt;span class="lia-unicode-emoji" title=":thinking_face:"&gt;🤔&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;It's an issue that makes a paradigm change to DLT have more merit, as automatic retries are just part of it and doesn't need thinking about or coding around.&lt;/P&gt;</description>
      <pubDate>Wed, 24 Jan 2024 20:28:24 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/structured-streaming-auto-loader-unknownfieldsexception-and/m-p/58364#M31112</guid>
      <dc:creator>ilarsen</dc:creator>
      <dc:date>2024-01-24T20:28:24Z</dc:date>
    </item>
  </channel>
</rss>

