cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Delta update/insert from multiple source tables

staskh
Contributor

[Sorry for a novice question.]

I have multiple tables periodically updated from external sources (including insert, update, or delete). I need to update a target table, which is an outer join from multiple source tables without rewriting it each time. I do not need to do it in real time, but only once a day.

What are Databricks' best practices, techniques, etc?

Will appreciate pointers to documentation, examples, and/or Academy lectures.

Regards,

Stas

 

Regards

Stas

1 ACCEPTED SOLUTION

Accepted Solutions

Louis_Frolio
Databricks Employee
Databricks Employee

Greetings @staskh , I did some digging and compiled my thoughts regarding your question.

Building a Daily Gold Table from Delta Sources

Treat this as a standard Gold table built daily from Delta sources. Start simple. Add incremental tricks only when full recomputes stop scaling. Most teams overbuild this on day one.

Below are three patterns, ordered from simplest to most involved. Pick the lowest-numbered one that fits your data volume and SLA, then graduate only when you have a real reason to.

Pattern 1: Full daily rebuild (start here)

If your data volumes are reasonable, recompute the whole target table once a day. This is the simplest, most reliable approach, and it's often the right one to keep.

Do this:

  1. Land each external feed into its own Delta table (Bronze, then Silver if you need it).
  2. Once a day, recompute the target from scratch with your outer joins.
  3. Schedule the recompute as a Databricks job or a Lakeflow pipeline step.

Example:

CREATE OR REPLACE TABLE analytics.target_joined AS
SELECT  ...
FROM    source_a a
FULL OUTER JOIN source_b b   ON ...
FULL OUTER JOIN source_c c   ON ...
;

Why this works: everything is Delta, everything is ACID, and you avoid the complexity of incremental join logic. Don't move past this pattern unless rebuilds are clearly hurting cost or latency.

Pattern 2: Materialized view with incremental refresh (move here when daily rebuilds get expensive)

If the joined result is large and rebuilds get costly, define the target as a materialized view over your Delta sources. Let Databricks incrementally refresh it.

Do this:

  1. Define the join as a materialized view.
CREATE OR REPLACE MATERIALIZED VIEW analytics.target_joined_mv AS
SELECT  ...
FROM    source_a a
FULL OUTER JOIN source_b b   ON ...
FULL OUTER JOIN source_c c   ON ...
;
  1. Turn on row tracking, deletion vectors, and Change Data Feed on every source.
ALTER TABLE source_a SET TBLPROPERTIES (
  delta.enableDeletionVectors = true,
  delta.enableRowTracking     = true,
  delta.enableChangeDataFeed  = true
);
  1. Run the refresh on a serverless Lakeflow or SQL pipeline, scheduled daily.

Serverless pipelines can incrementally refresh many queries, including ones with LEFT, RIGHT, and FULL OUTER JOIN, but only when the documented conditions are met. Read the conditions before you commit (1).

Pattern 3: Fully incremental CDC-based pattern (only when the data is large and frequently changing)

Use this when your source systems expose inserts, updates, and deletes, and you need to propagate just the changes.

Do this:

  1. Enable Change Data Feed on the Delta sources, or use AUTO CDC for upstream CDC feeds.
  2. In Lakeflow Spark Declarative Pipelines (formerly DLT), define streaming tables and AUTO CDC flows for each source.
  3. Define a downstream materialized view or table that does the outer join over those streaming tables, once per run.

Don't reach for this pattern unless you actually need it. It scales to large, frequently changing tables and preserves history (SCD2) when you want it, but the engineering and operational cost is real.

Where to learn this on Databricks Academy

Take these two courses, in order:

  1. Data Engineering with Databricks. Modules on Delta Lake, medallion architecture, and incremental ETL.
  2. Advanced Data Engineering with Databricks. Modules on Structured Streaming, Lakeflow pipelines, CDF, and CDC-based pipelines, including outer-join Gold tables.

They walk through these exact patterns and complement the docs.

References

(1) Incremental refresh for materialized views: https://docs.databricks.com/aws/en/optimizations/incremental-refresh (2) Change data capture and snapshots: https://docs.databricks.com/aws/en/ldp/what-is-change-data-capture (3) The AUTO CDC APIs: https://docs.databricks.com/aws/en/ldp/cdc (4) Delta Change Data Feed: https://docs.databricks.com/aws/en/delta/delta-change-data-feed (5) CDC with Delta Live Tables: https://www.databricks.com/blog/2022/04/25/simplifying-change-data-capture-with-databricks-delta-liv...

Hope this helps.

Cheers, Louis.

View solution in original post

2 REPLIES 2

Louis_Frolio
Databricks Employee
Databricks Employee

Greetings @staskh , I did some digging and compiled my thoughts regarding your question.

Building a Daily Gold Table from Delta Sources

Treat this as a standard Gold table built daily from Delta sources. Start simple. Add incremental tricks only when full recomputes stop scaling. Most teams overbuild this on day one.

Below are three patterns, ordered from simplest to most involved. Pick the lowest-numbered one that fits your data volume and SLA, then graduate only when you have a real reason to.

Pattern 1: Full daily rebuild (start here)

If your data volumes are reasonable, recompute the whole target table once a day. This is the simplest, most reliable approach, and it's often the right one to keep.

Do this:

  1. Land each external feed into its own Delta table (Bronze, then Silver if you need it).
  2. Once a day, recompute the target from scratch with your outer joins.
  3. Schedule the recompute as a Databricks job or a Lakeflow pipeline step.

Example:

CREATE OR REPLACE TABLE analytics.target_joined AS
SELECT  ...
FROM    source_a a
FULL OUTER JOIN source_b b   ON ...
FULL OUTER JOIN source_c c   ON ...
;

Why this works: everything is Delta, everything is ACID, and you avoid the complexity of incremental join logic. Don't move past this pattern unless rebuilds are clearly hurting cost or latency.

Pattern 2: Materialized view with incremental refresh (move here when daily rebuilds get expensive)

If the joined result is large and rebuilds get costly, define the target as a materialized view over your Delta sources. Let Databricks incrementally refresh it.

Do this:

  1. Define the join as a materialized view.
CREATE OR REPLACE MATERIALIZED VIEW analytics.target_joined_mv AS
SELECT  ...
FROM    source_a a
FULL OUTER JOIN source_b b   ON ...
FULL OUTER JOIN source_c c   ON ...
;
  1. Turn on row tracking, deletion vectors, and Change Data Feed on every source.
ALTER TABLE source_a SET TBLPROPERTIES (
  delta.enableDeletionVectors = true,
  delta.enableRowTracking     = true,
  delta.enableChangeDataFeed  = true
);
  1. Run the refresh on a serverless Lakeflow or SQL pipeline, scheduled daily.

Serverless pipelines can incrementally refresh many queries, including ones with LEFT, RIGHT, and FULL OUTER JOIN, but only when the documented conditions are met. Read the conditions before you commit (1).

Pattern 3: Fully incremental CDC-based pattern (only when the data is large and frequently changing)

Use this when your source systems expose inserts, updates, and deletes, and you need to propagate just the changes.

Do this:

  1. Enable Change Data Feed on the Delta sources, or use AUTO CDC for upstream CDC feeds.
  2. In Lakeflow Spark Declarative Pipelines (formerly DLT), define streaming tables and AUTO CDC flows for each source.
  3. Define a downstream materialized view or table that does the outer join over those streaming tables, once per run.

Don't reach for this pattern unless you actually need it. It scales to large, frequently changing tables and preserves history (SCD2) when you want it, but the engineering and operational cost is real.

Where to learn this on Databricks Academy

Take these two courses, in order:

  1. Data Engineering with Databricks. Modules on Delta Lake, medallion architecture, and incremental ETL.
  2. Advanced Data Engineering with Databricks. Modules on Structured Streaming, Lakeflow pipelines, CDF, and CDC-based pipelines, including outer-join Gold tables.

They walk through these exact patterns and complement the docs.

References

(1) Incremental refresh for materialized views: https://docs.databricks.com/aws/en/optimizations/incremental-refresh (2) Change data capture and snapshots: https://docs.databricks.com/aws/en/ldp/what-is-change-data-capture (3) The AUTO CDC APIs: https://docs.databricks.com/aws/en/ldp/cdc (4) Delta Change Data Feed: https://docs.databricks.com/aws/en/delta/delta-change-data-feed (5) CDC with Delta Live Tables: https://www.databricks.com/blog/2022/04/25/simplifying-change-data-capture-with-databricks-delta-liv...

Hope this helps.

Cheers, Louis.

staskh
Contributor

Thank you for such informtive and helpfull response!