cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Declarative Pipelines - Dynamic Overwrite

ismaelhenzel
Valued Contributor

Regarding the limitations of declarative pipelines—specifically the inability to use replaceWhere—I discovered through testing that materialized views actually support dynamic overwrites. This handles several scenarios where replaceWhere would typically be used. While MERGE is an option, there are cases where a primary key doesn't exist and you need to overwrite data based on a specific column, such as a date. Surprisingly, dynamic overwrite works perfectly in materialized views for this purpose.

where is my code example: 

from pyspark import pipelines as dp
from pyspark.sql import functions as F
from datetime import date, timedelta

@dp.table(
    name="TESTE",
    partition_cols=["Data"],
    spark_conf={
        "spark.sql.sources.partitionOverwriteMode": "dynamic"
    }
)
def teste():
    data_hoje = date.today()
    data_ontem = data_hoje - timedelta(days=1)
    
    dados_mock = [
        (data_hoje, "Linha A", 100.0, "D"),
        (data_hoje, "Linha B", 150.0, "N"),
        (data_hoje, "Linha A", 200.0, "D"),
        (data_hoje, "Linha B", 250.0, "N")
    ]
    
    # Criando o DataFrame
    df = spark.createDataFrame(dados_mock, ["Data", "Linha", "QuantidadeProduzida", "Turno"])
    
    # 2. Garantir que a coluna Data é do tipo Date
    return df.withColumn("Data", F.col("Data").cast("date"))

In the first run, the MV saved four records. Afterward, I modified dados_mock to include only four records for data_hoje. As a result, the MV ended up with the expected six records: it replaced two existing records and added two new ones. I also tested adding partitions that didn't exist yet, and they were inserted correctly. My question is: How unusual is it to use this approach in a Materialized View? I realize a full refresh would wipe this data, but the same applies to streaming tables in declarative pipelines. Has anyone else used this dynamic overwrite approach in declarative pipeline MVs?

 

1 ACCEPTED SOLUTION

Accepted Solutions

osingh
Contributor

This is a really interesting find, and honestly not something most people expect from materialized views.

Under the hood, MVs in Databricks declarative pipelines are still Delta tables. So when you set partitionOverwriteMode=dynamic and partition by a column like a date, Delta just does what it’s supposed to do. It overwrites only the partitions present in the write, keeps the rest, and inserts new ones if they don’t exist.

That’s why your results make sense:

-Rows for the same date get replaced
-New rows show up correctly
-Other dates are untouched

Most teams don’t use MVs this way because people usually think of them as “derived and recomputed,” not something that behaves like a controlled overwrite table. Since replaceWhere isn’t allowed, folks normally jump straight to MERGE or streaming tables.

But in cases like yours, where there’s no real primary key and a clear overwrite boundary like a date, this feels like a very reasonable solution. It’s simpler than MERGE and still gives you idempotent behavior.

The big thing to be aware of, like you already mentioned, is full refresh. That would blow everything away and rebuild it. But that’s not really unique to MVs; streaming tables in declarative pipelines have the same risk.

So yeah, it’s not a common pattern, but it’s not wrong either. I’ve seen similar approaches used in batch-style DLT pipelines when people really understand how Delta behaves. As long as it’s documented and intentional, this seems like a valid approach.

Om Singh

View solution in original post

1 REPLY 1

osingh
Contributor

This is a really interesting find, and honestly not something most people expect from materialized views.

Under the hood, MVs in Databricks declarative pipelines are still Delta tables. So when you set partitionOverwriteMode=dynamic and partition by a column like a date, Delta just does what it’s supposed to do. It overwrites only the partitions present in the write, keeps the rest, and inserts new ones if they don’t exist.

That’s why your results make sense:

-Rows for the same date get replaced
-New rows show up correctly
-Other dates are untouched

Most teams don’t use MVs this way because people usually think of them as “derived and recomputed,” not something that behaves like a controlled overwrite table. Since replaceWhere isn’t allowed, folks normally jump straight to MERGE or streaming tables.

But in cases like yours, where there’s no real primary key and a clear overwrite boundary like a date, this feels like a very reasonable solution. It’s simpler than MERGE and still gives you idempotent behavior.

The big thing to be aware of, like you already mentioned, is full refresh. That would blow everything away and rebuild it. But that’s not really unique to MVs; streaming tables in declarative pipelines have the same risk.

So yeah, it’s not a common pattern, but it’s not wrong either. I’ve seen similar approaches used in batch-style DLT pipelines when people really understand how Delta behaves. As long as it’s documented and intentional, this seems like a valid approach.

Om Singh