Hi,
We have a parquet table (folder) in Azure Storage Account.
The table is partitioned by column PeriodId (represents a day in the format YYYYMMDD) and has data from 20181001 until 20211121 (yesterday).
We have a new development that adds a new column to this table from 20211101 onwards.
When we read the data for the interval [20211101, 20211121] in a Scala notebook, the dataframe does not return the new column.
What is the best way to solve this problem without having to rewrite all partitions with all columns?
Having the table in Delta format instead of parquet would solve the problem?
Or is just changing the way the table (folder) is saved?
This is an excerpt of the code used to create the table (if it does not exists) or insert data into a partition.
val fileFormat = "parquet"
val filePartitionBy = "PeriodId"
val fileSaveMode = "overwrite"
val filePath = "abfss://<container>@<storage account>.dfs.core.windows.net/<folder>/<table name>"
var fileOptions = Map (
"header" -> "true",
"overwriteSchema" -> "true"
)
dfFinal
.write
.format (fileFormat)
.partitionBy (filePartitionBy)
.mode (fileSaveMode)
.options (fileOptions)
.save (filePath)
Thanks in advance,
Tiago Rente.