<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Schema hints: define column type as struct and incrementally add fields with schema evolution in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/schema-hints-define-column-type-as-struct-and-incrementally-add/m-p/132141#M49367</link>
    <description>&lt;P&gt;Hello&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/175553"&gt;@yit&lt;/a&gt;&amp;nbsp;,&lt;BR /&gt;You can’t. An “empty struct”&amp;nbsp; is treated as a fixed struct with zero fields, so AutoLoader will not expand it later. The NOTE in the screenshot applies to JSON just as much as Parquet/Avro/CSV.&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;If your goal is “discover whatever shows up under payload and keep adding new sub-fields,” simply don’t specify a hint for payload. AutoLoader will infer and evolve nested fields as and when they appear.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;U&gt;&lt;STRONG&gt;Example code(You can run it anywhere):&lt;/STRONG&gt;&lt;/U&gt;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;base    = "/tmp/repro_empty_struct_json/input"
out    = "/tmp/repro_empty_struct_json/out_empty_struct"
chk    = "/tmp/repro_empty_struct_json/chk"
schema = "/tmp/repro_empty_struct_json/schema"


# cleanup
for p in [base, out, chk,schema]:
    _ = dbutils.fs.rm(p, True)

# two files: second file introduces a new nested subfield "bar"
dbutils.fs.mkdirs(base)
dbutils.fs.put(f"{base}/file1.json", """{"id":1,"payload":{"foo":"x"}}""", True)
dbutils.fs.put(f"{base}/file2.json", """{"id":2,"payload":{"foo":"y","bar":123}}""", True)

spark.conf.set("spark.databricks.delta.schema.autoMerge.enabled", "true")  # writer evolution

### Run the below code###
dfB = (spark.readStream
  .format("cloudFiles")
  .option("cloudFiles.format", "json")
  .option("cloudFiles.schemaLocation", schema)  
  .option("cloudFiles.inferColumnTypes", "true")    
  .option("cloudFiles.schemaEvolutionMode", "addNewColumns")
  # no schemaHints for payload
  .load(base))

qB = (dfB.writeStream
  .format("delta")
  .option("checkpointLocation", chk)
  .trigger(availableNow=True)
  .start(out))
qB.awaitTermination()

spark.read.format("delta").load(out).printSchema()
print("C) Data:")
display(spark.read.format("delta").load(out))

##### Add a new file with more subfields####
dbutils.fs.put(f"{base}/file3.json",
               """{"id":2,"payload":{"foo":"y","bar":123,"abc":{"foo1":"x"}}}""",
               True)


#### Re-run the above code again ###

You will see that the job will fail for the first time, and once you retry ,it will ivolve the schema automatically and provide the expected schema and result&lt;/LI-CODE&gt;&lt;P&gt;Please do let me know if you have any further questions. Thanks!&lt;/P&gt;</description>
    <pubDate>Tue, 16 Sep 2025 17:06:38 GMT</pubDate>
    <dc:creator>K_Anudeep</dc:creator>
    <dc:date>2025-09-16T17:06:38Z</dc:date>
    <item>
      <title>Schema hints: define column type as struct and incrementally add fields with schema evolution</title>
      <link>https://community.databricks.com/t5/data-engineering/schema-hints-define-column-type-as-struct-and-incrementally-add/m-p/132120#M49359</link>
      <description>&lt;P&gt;Hey everyone,&lt;/P&gt;&lt;P&gt;I want to set column type as empty struct via schema hints without specifying subfields. Then I expect the struct to be evolved with subfields through schema evolution when new subfields appear in the data.&amp;nbsp;&lt;/P&gt;&lt;P&gt;But, I've found in the documentation this explanation:&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="yit_2-1758029850608.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/20011iEE9E8E0B4386AFC0/image-size/medium?v=v2&amp;amp;px=400" role="button" title="yit_2-1758029850608.png" alt="yit_2-1758029850608.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;Does this affect JSON files as well? Or, can I define empty struct and then evolve it with subfields?&amp;nbsp;&lt;/P&gt;&lt;P&gt;If yes, how? Because I've tried different approaches but nothing works.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 16 Sep 2025 13:38:43 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/schema-hints-define-column-type-as-struct-and-incrementally-add/m-p/132120#M49359</guid>
      <dc:creator>yit</dc:creator>
      <dc:date>2025-09-16T13:38:43Z</dc:date>
    </item>
    <item>
      <title>Re: Schema hints: define column type as struct and incrementally add fields with schema evolution</title>
      <link>https://community.databricks.com/t5/data-engineering/schema-hints-define-column-type-as-struct-and-incrementally-add/m-p/132141#M49367</link>
      <description>&lt;P&gt;Hello&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/175553"&gt;@yit&lt;/a&gt;&amp;nbsp;,&lt;BR /&gt;You can’t. An “empty struct”&amp;nbsp; is treated as a fixed struct with zero fields, so AutoLoader will not expand it later. The NOTE in the screenshot applies to JSON just as much as Parquet/Avro/CSV.&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;If your goal is “discover whatever shows up under payload and keep adding new sub-fields,” simply don’t specify a hint for payload. AutoLoader will infer and evolve nested fields as and when they appear.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;U&gt;&lt;STRONG&gt;Example code(You can run it anywhere):&lt;/STRONG&gt;&lt;/U&gt;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;base    = "/tmp/repro_empty_struct_json/input"
out    = "/tmp/repro_empty_struct_json/out_empty_struct"
chk    = "/tmp/repro_empty_struct_json/chk"
schema = "/tmp/repro_empty_struct_json/schema"


# cleanup
for p in [base, out, chk,schema]:
    _ = dbutils.fs.rm(p, True)

# two files: second file introduces a new nested subfield "bar"
dbutils.fs.mkdirs(base)
dbutils.fs.put(f"{base}/file1.json", """{"id":1,"payload":{"foo":"x"}}""", True)
dbutils.fs.put(f"{base}/file2.json", """{"id":2,"payload":{"foo":"y","bar":123}}""", True)

spark.conf.set("spark.databricks.delta.schema.autoMerge.enabled", "true")  # writer evolution

### Run the below code###
dfB = (spark.readStream
  .format("cloudFiles")
  .option("cloudFiles.format", "json")
  .option("cloudFiles.schemaLocation", schema)  
  .option("cloudFiles.inferColumnTypes", "true")    
  .option("cloudFiles.schemaEvolutionMode", "addNewColumns")
  # no schemaHints for payload
  .load(base))

qB = (dfB.writeStream
  .format("delta")
  .option("checkpointLocation", chk)
  .trigger(availableNow=True)
  .start(out))
qB.awaitTermination()

spark.read.format("delta").load(out).printSchema()
print("C) Data:")
display(spark.read.format("delta").load(out))

##### Add a new file with more subfields####
dbutils.fs.put(f"{base}/file3.json",
               """{"id":2,"payload":{"foo":"y","bar":123,"abc":{"foo1":"x"}}}""",
               True)


#### Re-run the above code again ###

You will see that the job will fail for the first time, and once you retry ,it will ivolve the schema automatically and provide the expected schema and result&lt;/LI-CODE&gt;&lt;P&gt;Please do let me know if you have any further questions. Thanks!&lt;/P&gt;</description>
      <pubDate>Tue, 16 Sep 2025 17:06:38 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/schema-hints-define-column-type-as-struct-and-incrementally-add/m-p/132141#M49367</guid>
      <dc:creator>K_Anudeep</dc:creator>
      <dc:date>2025-09-16T17:06:38Z</dc:date>
    </item>
    <item>
      <title>Re: Schema hints: define column type as struct and incrementally add fields with schema evolution</title>
      <link>https://community.databricks.com/t5/data-engineering/schema-hints-define-column-type-as-struct-and-incrementally-add/m-p/132194#M49378</link>
      <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/60098"&gt;@K_Anudeep&lt;/a&gt;&amp;nbsp;thank you for the reply!&amp;nbsp;&lt;BR /&gt;This is how I've developed it, but I have some erroneous files where that exact column is array instead of struct, so it's inferred as string (the most generic type between array and struct).&lt;BR /&gt;My goal was to define via schema hints that 'this column should be struct, but the nested structure should be evolved'.&lt;/P&gt;</description>
      <pubDate>Wed, 17 Sep 2025 07:03:40 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/schema-hints-define-column-type-as-struct-and-incrementally-add/m-p/132194#M49378</guid>
      <dc:creator>yit</dc:creator>
      <dc:date>2025-09-17T07:03:40Z</dc:date>
    </item>
    <item>
      <title>Re: Schema hints: define column type as struct and incrementally add fields with schema evolution</title>
      <link>https://community.databricks.com/t5/data-engineering/schema-hints-define-column-type-as-struct-and-incrementally-add/m-p/132206#M49381</link>
      <description>&lt;P&gt;Hello&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/175553"&gt;@yit&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;
&lt;P&gt;Yeah, that's right, in that case, it would always evolve as a string, and that's an expected behaviour in Autoloder by design. Screenshot below:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 17 Sep 2025 07:50:04 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/schema-hints-define-column-type-as-struct-and-incrementally-add/m-p/132206#M49381</guid>
      <dc:creator>K_Anudeep</dc:creator>
      <dc:date>2025-09-17T07:50:04Z</dc:date>
    </item>
  </channel>
</rss>

