yesterday
[Sorry for a novice question.]
I have multiple tables periodically updated from external sources (including insert, update, or delete). I need to update a target table, which is an outer join from multiple source tables without rewriting it each time. I do not need to do it in real time, but only once a day.
What are Databricks' best practices, techniques, etc?
Will appreciate pointers to documentation, examples, and/or Academy lectures.
Regards,
Stas
Regards
Stas
yesterday
Greetings @staskh , I did some digging and compiled my thoughts regarding your question.
Treat this as a standard Gold table built daily from Delta sources. Start simple. Add incremental tricks only when full recomputes stop scaling. Most teams overbuild this on day one.
Below are three patterns, ordered from simplest to most involved. Pick the lowest-numbered one that fits your data volume and SLA, then graduate only when you have a real reason to.
If your data volumes are reasonable, recompute the whole target table once a day. This is the simplest, most reliable approach, and it's often the right one to keep.
Do this:
Example:
CREATE OR REPLACE TABLE analytics.target_joined AS
SELECT ...
FROM source_a a
FULL OUTER JOIN source_b b ON ...
FULL OUTER JOIN source_c c ON ...
;
Why this works: everything is Delta, everything is ACID, and you avoid the complexity of incremental join logic. Don't move past this pattern unless rebuilds are clearly hurting cost or latency.
If the joined result is large and rebuilds get costly, define the target as a materialized view over your Delta sources. Let Databricks incrementally refresh it.
Do this:
CREATE OR REPLACE MATERIALIZED VIEW analytics.target_joined_mv AS
SELECT ...
FROM source_a a
FULL OUTER JOIN source_b b ON ...
FULL OUTER JOIN source_c c ON ...
;
ALTER TABLE source_a SET TBLPROPERTIES (
delta.enableDeletionVectors = true,
delta.enableRowTracking = true,
delta.enableChangeDataFeed = true
);
Serverless pipelines can incrementally refresh many queries, including ones with LEFT, RIGHT, and FULL OUTER JOIN, but only when the documented conditions are met. Read the conditions before you commit (1).
Use this when your source systems expose inserts, updates, and deletes, and you need to propagate just the changes.
Do this:
Don't reach for this pattern unless you actually need it. It scales to large, frequently changing tables and preserves history (SCD2) when you want it, but the engineering and operational cost is real.
Take these two courses, in order:
They walk through these exact patterns and complement the docs.
(1) Incremental refresh for materialized views: https://docs.databricks.com/aws/en/optimizations/incremental-refresh (2) Change data capture and snapshots: https://docs.databricks.com/aws/en/ldp/what-is-change-data-capture (3) The AUTO CDC APIs: https://docs.databricks.com/aws/en/ldp/cdc (4) Delta Change Data Feed: https://docs.databricks.com/aws/en/delta/delta-change-data-feed (5) CDC with Delta Live Tables: https://www.databricks.com/blog/2022/04/25/simplifying-change-data-capture-with-databricks-delta-liv...
Hope this helps.
Cheers, Louis.
yesterday
Greetings @staskh , I did some digging and compiled my thoughts regarding your question.
Treat this as a standard Gold table built daily from Delta sources. Start simple. Add incremental tricks only when full recomputes stop scaling. Most teams overbuild this on day one.
Below are three patterns, ordered from simplest to most involved. Pick the lowest-numbered one that fits your data volume and SLA, then graduate only when you have a real reason to.
If your data volumes are reasonable, recompute the whole target table once a day. This is the simplest, most reliable approach, and it's often the right one to keep.
Do this:
Example:
CREATE OR REPLACE TABLE analytics.target_joined AS
SELECT ...
FROM source_a a
FULL OUTER JOIN source_b b ON ...
FULL OUTER JOIN source_c c ON ...
;
Why this works: everything is Delta, everything is ACID, and you avoid the complexity of incremental join logic. Don't move past this pattern unless rebuilds are clearly hurting cost or latency.
If the joined result is large and rebuilds get costly, define the target as a materialized view over your Delta sources. Let Databricks incrementally refresh it.
Do this:
CREATE OR REPLACE MATERIALIZED VIEW analytics.target_joined_mv AS
SELECT ...
FROM source_a a
FULL OUTER JOIN source_b b ON ...
FULL OUTER JOIN source_c c ON ...
;
ALTER TABLE source_a SET TBLPROPERTIES (
delta.enableDeletionVectors = true,
delta.enableRowTracking = true,
delta.enableChangeDataFeed = true
);
Serverless pipelines can incrementally refresh many queries, including ones with LEFT, RIGHT, and FULL OUTER JOIN, but only when the documented conditions are met. Read the conditions before you commit (1).
Use this when your source systems expose inserts, updates, and deletes, and you need to propagate just the changes.
Do this:
Don't reach for this pattern unless you actually need it. It scales to large, frequently changing tables and preserves history (SCD2) when you want it, but the engineering and operational cost is real.
Take these two courses, in order:
They walk through these exact patterns and complement the docs.
(1) Incremental refresh for materialized views: https://docs.databricks.com/aws/en/optimizations/incremental-refresh (2) Change data capture and snapshots: https://docs.databricks.com/aws/en/ldp/what-is-change-data-capture (3) The AUTO CDC APIs: https://docs.databricks.com/aws/en/ldp/cdc (4) Delta Change Data Feed: https://docs.databricks.com/aws/en/delta/delta-change-data-feed (5) CDC with Delta Live Tables: https://www.databricks.com/blog/2022/04/25/simplifying-change-data-capture-with-databricks-delta-liv...
Hope this helps.
Cheers, Louis.
19 hours ago
Thank you for such informtive and helpfull response!