Databricks

tarente · ‎11-22-2021

Hi,

We have a parquet table (folder) in Azure Storage Account.

The table is partitioned by column PeriodId (represents a day in the format YYYYMMDD) and has data from 20181001 until 20211121 (yesterday).

We have a new development that adds a new column to this table from 20211101 onwards.

When we read the data for the interval [20211101, 20211121] in a Scala notebook, the dataframe does not return the new column.

What is the best way to solve this problem without having to rewrite all partitions with all columns?

Having the table in Delta format instead of parquet would solve the problem?

Or is just changing the way the table (folder) is saved?

This is an excerpt of the code used to create the table (if it does not exists) or insert data into a partition.

val fileFormat      = "parquet"
val filePartitionBy = "PeriodId"
val fileSaveMode    = "overwrite"
val filePath        = "abfss://<container>@<storage account>.dfs.core.windows.net/<folder>/<table name>"
 
var fileOptions = Map (
                        "header" -> "true",
                        "overwriteSchema" -> "true"
                      )
 
dfFinal
  .write
  .format      (fileFormat)
  .partitionBy (filePartitionBy)
  .mode        (fileSaveMode)
  .options     (fileOptions)
  .save        (filePath)

Thanks in advance,

Tiago Rente.

Hubert-Dudek · ‎11-22-2021

I think problem is in overwrite as when you overwrite it overwrites all folders. Solution is to mix append with dynamic overwrite so it will overwrite only folders which have data and doesn't affect old partitions:

spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic")

tarente · ‎11-23-2021

Hi Hubert,

The overwrite is not overwriting all folders, it only adds the new column to the re-written partitions.

The problem is that even I filter only the re-written partitions in the dataframe I do not see the new added columns. However, if I open one of the parquet files of the re-written partitions, I do see the new columns.

If I open one of the parquet files of the original partitions, I do not see the new columns.

I.e., the parquet files have the new column in the new partitions but not in the original partitions. That is something I would expect.

What I would expect and is not happening, is to get the new column when filtering only re-written partitions.

Thanks,

Tiago Rente.

jose_gonzalez · ‎12-10-2021

Hi @Tiago Rente

Have you try schema evolution? docs here https://spark.apache.org/docs/latest/sql-data-sources-parquet.html#schema-merging

I think that having your table as Delta will solve this issue. You might want to test it.

Databricks

Partitioned parquet table (folder) with different structure

How to successfully build GenAI applications

Registration now open! Databricks Data + AI Summit 2024

Meet DBRX, the New Standard for High-Quality LLMs

Register now and save 50% on training at Data + AI Summit!