Schema Evolution - Auto Loader for Avro format is ...

venkat09 · ‎01-31-2023

* Reading Avro files from s3 and then writing to the delta table

* Ingested sample data of 10 files, which contain four columns, and it infers the schema automatically as expected

* Introducing a new file which contains a new column [foo] along with existing columns and stream failed and threw identified new field error, which is expected

* Restarting the stream, add the new columns to the delta table

* Introducing a new file which contains another new column [Foo, but only it differs by case compared to the previous new column]

* Expected: stream should not fail and add that new column info into the **_rescued_data**

* Actual: stream failed to throw the below-given error message

* com.databricks.sql.transaction.tahoe.DeltaAnalysisException: Found duplicate column(s) in the data to save: metadata

NOTE: I saw the option `readerCaseSensitive` in the document, but the explanation is unclear. I tried to set both false and true but faced the same issue.

```

stream = (spark.readStream

.format("cloudFiles")

.option("cloudFiles.format", "avro")

.option("cloudFiles.schemaLocation", bronzeCheckpoint)

#.option("readerCaseSensitive", False)

.load(rawDataSource)

.writeStream

.option("path", bronzeTable)

.option("checkpointLocation", bronzeCheckpoint)

.option("mergeSchema", True)

.table(bronzeTableName)

)

```

My understanding from the document, if there are case mismatches in the column name, the column not that in the schema capture should be moved to _rescued_column. Please let me know if that s not the case. Thanks

Schema Evolution - Auto Loader for Avro format is not working as expected