<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Schema evolution for JSON files with AutoLoader in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/schema-evolution-for-json-files-with-autoloader/m-p/125572#M47485</link>
    <description>&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I am using Auto Loader to ingest JSON files into a managed table. Auto Loader saves only the first-level fields as new columns, while nested structs are stored as values within those columns.&lt;/P&gt;&lt;P&gt;My goal is to support schema evolution when loading new files. However, Auto Loader only detects changes at the top-level columns. What are possible solutions to track and handle schema evolution for nested JSON structures?&lt;/P&gt;&lt;P&gt;Here's the code that I'm using:&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;df = (
spark.readStream
.format("cloudFiles") 
.option("trigger","true")
.option("multiLine", "false")
.option("cloudFiles.format", "json") 
.option("cloudFiles.inferColumnTypes", "true") 
.option("recursiveFileLookup", "true")
.option("cloudFiles.schemaEvolutionMode", "addNewColumns")
.option("readerCaseSensitive","false")
.option('cloudFiles.schemaLocation', checkpoint_path)
.load(source_path)
)
(
df.writeStream
.format("delta")
.option("mergeSchema", "true") 
.option("checkpointLocation", checkpoint_path)
.outputMode("append")
.trigger(availableNow=True) 
.table(target_table) 
)&lt;/LI-CODE&gt;</description>
    <pubDate>Thu, 17 Jul 2025 11:43:59 GMT</pubDate>
    <dc:creator>yit</dc:creator>
    <dc:date>2025-07-17T11:43:59Z</dc:date>
    <item>
      <title>Schema evolution for JSON files with AutoLoader</title>
      <link>https://community.databricks.com/t5/data-engineering/schema-evolution-for-json-files-with-autoloader/m-p/125572#M47485</link>
      <description>&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I am using Auto Loader to ingest JSON files into a managed table. Auto Loader saves only the first-level fields as new columns, while nested structs are stored as values within those columns.&lt;/P&gt;&lt;P&gt;My goal is to support schema evolution when loading new files. However, Auto Loader only detects changes at the top-level columns. What are possible solutions to track and handle schema evolution for nested JSON structures?&lt;/P&gt;&lt;P&gt;Here's the code that I'm using:&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;df = (
spark.readStream
.format("cloudFiles") 
.option("trigger","true")
.option("multiLine", "false")
.option("cloudFiles.format", "json") 
.option("cloudFiles.inferColumnTypes", "true") 
.option("recursiveFileLookup", "true")
.option("cloudFiles.schemaEvolutionMode", "addNewColumns")
.option("readerCaseSensitive","false")
.option('cloudFiles.schemaLocation', checkpoint_path)
.load(source_path)
)
(
df.writeStream
.format("delta")
.option("mergeSchema", "true") 
.option("checkpointLocation", checkpoint_path)
.outputMode("append")
.trigger(availableNow=True) 
.table(target_table) 
)&lt;/LI-CODE&gt;</description>
      <pubDate>Thu, 17 Jul 2025 11:43:59 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/schema-evolution-for-json-files-with-autoloader/m-p/125572#M47485</guid>
      <dc:creator>yit</dc:creator>
      <dc:date>2025-07-17T11:43:59Z</dc:date>
    </item>
    <item>
      <title>Re: Schema evolution for JSON files with AutoLoader</title>
      <link>https://community.databricks.com/t5/data-engineering/schema-evolution-for-json-files-with-autoloader/m-p/125573#M47486</link>
      <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/175553"&gt;@yit&lt;/a&gt;&amp;nbsp;perhaps it's worth investigating this as a potential route:&lt;BR /&gt;&lt;A href="https://community.databricks.com/t5/knowledge-sharing-hub/handling-complex-nested-json-in-databricks-using-schemahints/td-p/116210" target="_blank" rel="noopener"&gt;https://community.databricks.com/t5/knowledge-sharing-hub/handling-complex-nested-json-in-databricks-using-schemahints/td-p/116210&lt;/A&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;It's an interesting problem!&lt;BR /&gt;&lt;BR /&gt;This route does seem supported in the documentation as well:&amp;nbsp;&lt;A href="https://docs.databricks.com/aws/en/dlt/from-json-schema-evolution#override-schema-inference-using-schema-hints" target="_blank"&gt;https://docs.databricks.com/aws/en/dlt/from-json-schema-evolution&amp;nbsp;&lt;/A&gt;&lt;BR /&gt;&lt;BR /&gt;All the best,&lt;BR /&gt;BS&lt;/P&gt;</description>
      <pubDate>Thu, 17 Jul 2025 11:52:51 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/schema-evolution-for-json-files-with-autoloader/m-p/125573#M47486</guid>
      <dc:creator>BS_THE_ANALYST</dc:creator>
      <dc:date>2025-07-17T11:52:51Z</dc:date>
    </item>
    <item>
      <title>Re: Schema evolution for JSON files with AutoLoader</title>
      <link>https://community.databricks.com/t5/data-engineering/schema-evolution-for-json-files-with-autoloader/m-p/125588#M47492</link>
      <description>&lt;P&gt;Thanks for the suggestions&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/146924"&gt;@BS_THE_ANALYST&lt;/a&gt;&amp;nbsp;.&lt;/P&gt;&lt;P&gt;My schema is dynamic so I can't use schema hints. The goal is when there is new nested field in the new coming JSON files, to merge it with the existing schema.&amp;nbsp;&lt;BR /&gt;Most of the time, the problem raises when there is {"metadata":{} } as nested field. Any thoughts?&lt;/P&gt;</description>
      <pubDate>Thu, 17 Jul 2025 13:17:04 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/schema-evolution-for-json-files-with-autoloader/m-p/125588#M47492</guid>
      <dc:creator>yit</dc:creator>
      <dc:date>2025-07-17T13:17:04Z</dc:date>
    </item>
    <item>
      <title>Re: Schema evolution for JSON files with AutoLoader</title>
      <link>https://community.databricks.com/t5/data-engineering/schema-evolution-for-json-files-with-autoloader/m-p/125602#M47495</link>
      <description>&lt;P&gt;Could you build out the schema dynamically by using something like this:&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="BS_THE_ANALYST_0-1752761847085.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/18267i5C03806F8C229541/image-size/medium?v=v2&amp;amp;px=400" role="button" title="BS_THE_ANALYST_0-1752761847085.png" alt="BS_THE_ANALYST_0-1752761847085.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&lt;A href="https://docs.databricks.com/aws/en/sql/language-manual/functions/schema_of_json" target="_blank" rel="noopener"&gt;https://docs.databricks.com/aws/en/sql/language-manual/functions/schema_of_json&lt;/A&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Perhaps there's a way to store your JSON in a variable and inject it into the function. I'm hopeful there'll be other routes aswell.&lt;/P&gt;&lt;P&gt;All the best,&lt;BR /&gt;BS&lt;/P&gt;</description>
      <pubDate>Thu, 17 Jul 2025 14:20:17 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/schema-evolution-for-json-files-with-autoloader/m-p/125602#M47495</guid>
      <dc:creator>BS_THE_ANALYST</dc:creator>
      <dc:date>2025-07-17T14:20:17Z</dc:date>
    </item>
    <item>
      <title>Re: Schema evolution for JSON files with AutoLoader</title>
      <link>https://community.databricks.com/t5/data-engineering/schema-evolution-for-json-files-with-autoloader/m-p/125630#M47498</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/175553"&gt;@yit&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;&lt;P&gt;Maybe give a try to a new VARIANT data type?&amp;nbsp;VARIANT is flexible to schema and type changes and maintains case sensitivity and NULL values present in the data source, this pattern is robust to most ingestion scenarios (at lease according to documentation &lt;span class="lia-unicode-emoji" title=":slightly_smiling_face:"&gt;🙂&lt;/span&gt;&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;&lt;A href="https://docs.databricks.com/aws/en/ingestion/variant" target="_blank"&gt;Ingest data as semi-structured variant type | Databricks Documentation&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 17 Jul 2025 16:01:26 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/schema-evolution-for-json-files-with-autoloader/m-p/125630#M47498</guid>
      <dc:creator>szymon_dybczak</dc:creator>
      <dc:date>2025-07-17T16:01:26Z</dc:date>
    </item>
    <item>
      <title>Re: Schema evolution for JSON files with AutoLoader</title>
      <link>https://community.databricks.com/t5/data-engineering/schema-evolution-for-json-files-with-autoloader/m-p/125725#M47518</link>
      <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/175553"&gt;@yit&lt;/a&gt;&amp;nbsp;I've just been introduced to the VARIANT datatype on the course I'm doing.&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/110502"&gt;@szymon_dybczak&lt;/a&gt;&amp;nbsp; is bang on the mark here. This seems perfect!&lt;BR /&gt;&lt;A href="https://www.youtube.com/watch?v=fWdxF7nL3YI" target="_blank"&gt;https://www.youtube.com/watch?v=fWdxF7nL3YI&lt;/A&gt;&amp;nbsp;&amp;nbsp;&lt;BR /&gt;&lt;A href="https://www.databricks.com/blog/introducing-open-variant-data-type-delta-lake-and-apache-spark" target="_blank"&gt;https://www.databricks.com/blog/introducing-open-variant-data-type-delta-lake-and-apache-spark&lt;/A&gt;&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;Really impressed with the speed benchmarking on the 2nd link! Can you let us know if this resolves the problem. Very interested &lt;span class="lia-unicode-emoji" title=":ok_hand:"&gt;👌&lt;/span&gt;.&lt;BR /&gt;&lt;BR /&gt;All the best,&lt;BR /&gt;BS&lt;/P&gt;</description>
      <pubDate>Fri, 18 Jul 2025 13:57:17 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/schema-evolution-for-json-files-with-autoloader/m-p/125725#M47518</guid>
      <dc:creator>BS_THE_ANALYST</dc:creator>
      <dc:date>2025-07-18T13:57:17Z</dc:date>
    </item>
    <item>
      <title>Re: Schema evolution for JSON files with AutoLoader</title>
      <link>https://community.databricks.com/t5/data-engineering/schema-evolution-for-json-files-with-autoloader/m-p/125983#M47598</link>
      <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/146924"&gt;@BS_THE_ANALYST&lt;/a&gt;&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/110502"&gt;@szymon_dybczak&lt;/a&gt;&amp;nbsp;thank you both for your replies.&lt;/P&gt;&lt;P&gt;The solution is already implemented in Auto Loader — it does exactly what I needed, even for nested fields.&lt;/P&gt;&lt;P&gt;The behavior that confused me was that when the evolution mode is set to addNewColumns and a file with new schema fields is processed, an error is initially raised for the new fields. At that same moment, however, the new merged schema is saved to the schema location. If we then reprocess the file (by running the stream again), it is processed correctly and schema evolution succeeds. The new fields are added and populated with null values for existing records.&lt;/P&gt;</description>
      <pubDate>Tue, 22 Jul 2025 10:45:19 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/schema-evolution-for-json-files-with-autoloader/m-p/125983#M47598</guid>
      <dc:creator>yit</dc:creator>
      <dc:date>2025-07-22T10:45:19Z</dc:date>
    </item>
    <item>
      <title>Re: Schema evolution for JSON files with AutoLoader</title>
      <link>https://community.databricks.com/t5/data-engineering/schema-evolution-for-json-files-with-autoloader/m-p/125986#M47601</link>
      <description>&lt;P&gt;Thanks for sharing the solution&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/175553"&gt;@yit&lt;/a&gt;&amp;nbsp;!&lt;/P&gt;</description>
      <pubDate>Tue, 22 Jul 2025 11:34:32 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/schema-evolution-for-json-files-with-autoloader/m-p/125986#M47601</guid>
      <dc:creator>szymon_dybczak</dc:creator>
      <dc:date>2025-07-22T11:34:32Z</dc:date>
    </item>
    <item>
      <title>Re: Schema evolution for JSON files with AutoLoader</title>
      <link>https://community.databricks.com/t5/data-engineering/schema-evolution-for-json-files-with-autoloader/m-p/126144#M47645</link>
      <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/175553"&gt;@yit&lt;/a&gt;&amp;nbsp;awesome. Glad that you got this solved. I look forward to the next problem &lt;span class="lia-unicode-emoji" title=":smirking_face:"&gt;😏&lt;/span&gt;.&lt;/P&gt;&lt;P&gt;All the best,&lt;BR /&gt;BS&lt;/P&gt;</description>
      <pubDate>Wed, 23 Jul 2025 10:28:20 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/schema-evolution-for-json-files-with-autoloader/m-p/126144#M47645</guid>
      <dc:creator>BS_THE_ANALYST</dc:creator>
      <dc:date>2025-07-23T10:28:20Z</dc:date>
    </item>
  </channel>
</rss>

