Databricks Community

jfrohnhaus · Thursday

After reviewing a surprising number of Databricks discussions around SCD2, CDC, historical reporting and temporal joins, I noticed that most historical data modeling challenges seem to fall into a small set of recurring patterns:
Historical Backfill
Late Arriving Dimension
Early Arriving Fact
Snapshot Reproducibility
Historical Match Ambiguity
Historical State Consolidation
What's interesting is that the implementation details differ, but the underlying modeling problems often look very similar.
Am I missing any major historical modeling patterns?
Curious how others would categorize these problems.

amirabedhiafi · yesterday

Hello!

I would add a few more historical modeling patterns that often appear separately, even though they overlap with SCD2, CDC, or temporal joins.

One important case is bi-temporal modeling, where you need to separate business effective time from system or load time.

For example: what was the customer status on 1 March? versus what did we believe the customer status was on 1 March, based on what we knew on 5 March?

Another pattern is historical correction or restatement, where history itself changes because a source system corrects past records. This is different from a normal late-arriving dimension because the old historical truth may need to be restated.

CDC pipelines often focus on inserts and updates, but deletes are also a major historical modeling challenge. The question becomes: do we physically remove the record, soft-delete it, close the SCD2 row, or keep a tombstone event?

Another case is when a customer, product, account, or vehicle changes identifiers over time, or when multiple source identities later become one. This creates historical continuity problems: is this the same entity or a new one?

There is also hierarchy and relationship history, which means not only attributes changing, but relationships changing. For example: employee-to-manager, product-to-category, customer-to-segment, or sales-partner-to-region. Historical reporting often breaks when only the child entity is modeled as SCD2, but the relationship path is not.

Another pattern is grain evolution, where the level of detail changes over time. For example, old data exists monthly while new data exists daily, or old product data exists at brand level while new data exists at model level. This creates issues for reproducible trends.

One last situation is temporal conformance across sources, where multiple systems each have their own version of history and their own effective dates. The problem is not only “what was true then?”, but also “which system’s truth should win for that period?”

If this answer resolves your question, could you please mark it as “Accept as Solution”? It will help other users quickly find the correct fix.

Senior BI/Data Engineer | Microsoft MVP Data Platform | Microsoft MVP Power BI | Power BI Super User | C# Corner MVP

View solution in original post

amirabedhiafi · yesterday