cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

DLT Continuous Pipeline load

JothyGanesan
New Contributor III

Hi All,
In our project we are working on the DLT pipeline with the DLT tables as target running in continuous mode.
These tables are common for multiple countries, and we go live in batches for different countries.

So, every time a new change is requested by the business which demands a change in the metadata of the DLT table, we are updating the DLT table creation in the notebook and we are forced to run the pipeline in full refresh.

This is becoming a big concern as:

  • Unless a table has data in its corresponding source system, we will not be able to reload the complete data as we do full-refresh and we might lose the history data which was already there in the current table prior to full-refresh
  • The SLA gets impacted as the refresh takes huge time to rerun the complete 2 years of data sometimes even close to 7 hours

Is there a way that we can work with these DLT pipelines but at the same time, work from that point of time run even if it demands metadata updates? Please help

2 REPLIES 2

ManojkMohan
Honored Contributor II

@JothyGanesan 

Use dynamic schema handling and selective table updates to apply metadata changes incrementally from the current watermark, preserving history across country go-lives.


Replace static @dlt.table definitions with Auto Loader's schema inference/evolution—handles column adds/drops/types without code changes

ManojkMohan_0-1768326383186.png

Metadata Change Workflow (No Full Refresh)

  1. Update silver/gold logic (add/drop columns, new CTEs).
  2. Start pipeline update → DLT only refreshes affected tables downstream.
  3. Bronze Auto Loader resumes incrementally using existing schema location + checkpoints

ManojkMohan_1-1768326512579.png

New country? Update isin() filter → incremental only.

 

Validate before Go Live:

ManojkMohan_2-1768326565633.png


This eliminates 7-hour refreshes—new metadata propagates in ~10-30 minutes depending on daily delta volume. Your continuous mode stays live during changes. Test on one country first?

SteveOstrowski
Databricks Employee
Databricks Employee
Hi @JothyGanesan,

This is a common scenario when running Lakeflow Spark Declarative Pipelines (SDP), previously known as DLT, in continuous mode across multi-country rollouts. There are several strategies to handle metadata changes on your streaming tables without resorting to a full refresh each time.

UNDERSTANDING WHAT REQUIRES A FULL REFRESH

Not all metadata changes require a full refresh. Here is a breakdown:

Changes that typically DO NOT require full refresh:
- Adding new columns to the end of a streaming table definition
- Changing table-level TBLPROPERTIES (comments, tags, etc.)
- Adding or modifying column comments
- Setting column-level tags or masks via ALTER STREAMING TABLE

Changes that typically DO require full refresh:
- Renaming existing columns
- Dropping columns
- Changing column data types
- Fundamentally restructuring the query logic

STRATEGY 1: USE DELTA COLUMN MAPPING

Enable Delta column mapping on your tables to allow metadata-only column renames and drops without rewriting data files. Set this table property in your pipeline definition:

TBLPROPERTIES (
'delta.columnMapping.mode' = 'name'
)

With column mapping enabled, operations like renaming or dropping columns become metadata-only changes, which means the underlying data files are not rewritten. Note that for streaming reads, you may need to configure a schema tracking location. When used inside SDP, this is managed automatically.

STRATEGY 2: USE APPEND FLOWS TO ADD NEW SOURCES WITHOUT FULL REFRESH

If the metadata change involves adding new data sources (for example, new countries going live), use append flows rather than modifying the main query. This lets you add new streaming sources to an existing streaming table without triggering a full refresh.

Python example:

from pyspark import pipelines as dp

dp.create_streaming_table("customers_silver")

@DP.append_flow(target="customers_silver")
def country_a_flow():
return spark.readStream.table("country_a_bronze")

@DP.append_flow(target="customers_silver")
def country_b_flow():
return spark.readStream.table("country_b_bronze")

SQL example:

CREATE OR REFRESH STREAMING TABLE customers_silver;

CREATE FLOW country_a_flow
AS INSERT INTO customers_silver BY NAME
SELECT * FROM STREAM(country_a_bronze);

CREATE FLOW country_b_flow
AS INSERT INTO customers_silver BY NAME
SELECT * FROM STREAM(country_b_bronze);

Each new country can be added as a new append flow, and existing data remains untouched.

STRATEGY 3: PROTECT CRITICAL TABLES WITH pipelines.reset.allowed

To prevent accidental full refreshes on tables with expensive historical data, set:

TBLPROPERTIES (
pipelines.reset.allowed = false
)

This blocks any full refresh on the table. If someone attempts a full refresh, the pipeline will skip that table. This is a safety net while you work on the metadata change approach.

STRATEGY 4: SEPARATE INGESTION FROM TRANSFORMATION

Consider structuring your pipeline in two layers:

1. A raw/bronze streaming table that ingests data as-is (minimal transformation, stable schema)
2. Downstream materialized views or streaming tables that apply the business metadata and transformations

When business metadata changes, you only need to update the downstream layer. Materialized views recompute incrementally by default and do not require a full refresh for query changes. The upstream streaming table with the raw data stays untouched.

STRATEGY 5: USE ALTER STREAMING TABLE FOR SUPPORTED CHANGES

For certain metadata changes, you can use ALTER STREAMING TABLE without modifying the pipeline definition at all:

ALTER STREAMING TABLE my_table
ALTER COLUMN my_column COMMENT 'Updated description';

ALTER STREAMING TABLE my_table
SET TAGS ('region' = 'EMEA', 'version' = '2.0');

These changes are applied immediately without requiring any refresh.

RECOMMENDED APPROACH FOR YOUR SCENARIO

Given that you have a multi-country continuous pipeline with 2+ years of historical data:

1. Set pipelines.reset.allowed = false on all critical streaming tables as an immediate safety measure
2. Enable delta.columnMapping.mode = 'name' on your tables to unlock metadata-only column operations
3. Use append flows for new country onboarding, so each country is a separate flow feeding the same target table
4. Move business metadata logic into downstream materialized views where changes do not require a full refresh of the streaming data
5. For any column additions, add them to the end of your schema definition, which is handled incrementally

DOCUMENTATION REFERENCES

- Pipeline update types and full refresh: https://docs.databricks.com/aws/en/delta-live-tables/updates
- Table properties including pipelines.reset.allowed: https://docs.databricks.com/aws/en/delta-live-tables/properties
- Append flows: https://docs.databricks.com/aws/en/delta-live-tables/flows
- Delta column mapping: https://docs.databricks.com/aws/en/delta/column-mapping.html
- Pipeline modes (continuous vs triggered): https://docs.databricks.com/aws/en/delta-live-tables/pipeline-mode

Note: The product previously known as "DLT" or "Delta Live Tables" is now called Lakeflow Spark Declarative Pipelines (SDP).

* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.

If this answer resolves your question, could you mark it as "Accept as Solution"? That helps other users quickly find the correct fix.