cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Schema hints: define column type as struct and incrementally add fields with schema evolution

yit
Contributor

Hey everyone,

I want to set column type as empty struct via schema hints without specifying subfields. Then I expect the struct to be evolved with subfields through schema evolution when new subfields appear in the data. 

But, I've found in the documentation this explanation: 

yit_2-1758029850608.png

Does this affect JSON files as well? Or, can I define empty struct and then evolve it with subfields? 

If yes, how? Because I've tried different approaches but nothing works.

 

1 ACCEPTED SOLUTION

Accepted Solutions

K_Anudeep
Databricks Employee
Databricks Employee

Hello @yit ,

Yeah, that's right, in that case, it would always evolve as a string, and that's an expected behaviour in Autoloder by design. Screenshot below:

 

View solution in original post

3 REPLIES 3

K_Anudeep
Databricks Employee
Databricks Employee

Hello @yit ,
You can’t. An “empty struct”  is treated as a fixed struct with zero fields, so AutoLoader will not expand it later. The NOTE in the screenshot applies to JSON just as much as Parquet/Avro/CSV.

If your goal is “discover whatever shows up under payload and keep adding new sub-fields,” simply don’t specify a hint for payload. AutoLoader will infer and evolve nested fields as and when they appear.

 

Example code(You can run it anywhere):

base    = "/tmp/repro_empty_struct_json/input"
out    = "/tmp/repro_empty_struct_json/out_empty_struct"
chk    = "/tmp/repro_empty_struct_json/chk"
schema = "/tmp/repro_empty_struct_json/schema"


# cleanup
for p in [base, out, chk,schema]:
    _ = dbutils.fs.rm(p, True)

# two files: second file introduces a new nested subfield "bar"
dbutils.fs.mkdirs(base)
dbutils.fs.put(f"{base}/file1.json", """{"id":1,"payload":{"foo":"x"}}""", True)
dbutils.fs.put(f"{base}/file2.json", """{"id":2,"payload":{"foo":"y","bar":123}}""", True)

spark.conf.set("spark.databricks.delta.schema.autoMerge.enabled", "true")  # writer evolution

### Run the below code###
dfB = (spark.readStream
  .format("cloudFiles")
  .option("cloudFiles.format", "json")
  .option("cloudFiles.schemaLocation", schema)  
  .option("cloudFiles.inferColumnTypes", "true")    
  .option("cloudFiles.schemaEvolutionMode", "addNewColumns")
  # no schemaHints for payload
  .load(base))

qB = (dfB.writeStream
  .format("delta")
  .option("checkpointLocation", chk)
  .trigger(availableNow=True)
  .start(out))
qB.awaitTermination()

spark.read.format("delta").load(out).printSchema()
print("C) Data:")
display(spark.read.format("delta").load(out))

##### Add a new file with more subfields####
dbutils.fs.put(f"{base}/file3.json",
               """{"id":2,"payload":{"foo":"y","bar":123,"abc":{"foo1":"x"}}}""",
               True)


#### Re-run the above code again ###

You will see that the job will fail for the first time, and once you retry ,it will ivolve the schema automatically and provide the expected schema and result

Please do let me know if you have any further questions. Thanks!

@K_Anudeep thank you for the reply! 
This is how I've developed it, but I have some erroneous files where that exact column is array instead of struct, so it's inferred as string (the most generic type between array and struct).
My goal was to define via schema hints that 'this column should be struct, but the nested structure should be evolved'.

K_Anudeep
Databricks Employee
Databricks Employee

Hello @yit ,

Yeah, that's right, in that case, it would always evolve as a string, and that's an expected behaviour in Autoloder by design. Screenshot below:

 

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now