cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Recurring Historical Data Modeling Patterns

jfrohnhaus
New Contributor

After reviewing a surprising number of Databricks discussions around SCD2, CDC, historical reporting and temporal joins, I noticed that most historical data modeling challenges seem to fall into a small set of recurring patterns:

  • Historical Backfill
  • Late Arriving Dimension
  • Early Arriving Fact
  • Snapshot Reproducibility
  • Historical Match Ambiguity
  • Historical State Consolidation

What's interesting is that the implementation details differ, but the underlying modeling problems often look very similar.

Am I missing any major historical modeling patterns?

Curious how others would categorize these problems.

1 ACCEPTED SOLUTION

Accepted Solutions

amirabedhiafi
Contributor

Hello!

I would add a few more historical modeling patterns that often appear separately, even though they overlap with SCD2, CDC, or temporal joins.

One important case is bi-temporal modeling, where you need to separate business effective time from system or load time.

For example: what was the customer status on 1 March? versus what did we believe the customer status was on 1 March, based on what we knew on 5 March?

Another pattern is historical correction or restatement, where history itself changes because a source system corrects past records. This is different from a normal late-arriving dimension because the old historical truth may need to be restated.

CDC pipelines often focus on inserts and updates, but deletes are also a major historical modeling challenge. The question becomes: do we physically remove the record, soft-delete it, close the SCD2 row, or keep a tombstone event?

Another case is when a customer, product, account, or vehicle changes identifiers over time, or when multiple source identities later become one. This creates historical continuity problems: is this the same entity or a new one?

There is also hierarchy and relationship history, which means not only attributes changing, but relationships changing. For example: employee-to-manager, product-to-category, customer-to-segment, or sales-partner-to-region. Historical reporting often breaks when only the child entity is modeled as SCD2, but the relationship path is not.

Another pattern is grain evolution, where the level of detail changes over time. For example, old data exists monthly while new data exists daily, or old product data exists at brand level while new data exists at model level. This creates issues for reproducible trends.

One last situation is temporal conformance across sources, where multiple systems each have their own version of history and their own effective dates. The problem is not only “what was true then?”, but also “which system’s truth should win for that period?”

If this answer resolves your question, could you please mark it as “Accept as Solution”? It will help other users quickly find the correct fix.

Senior BI/Data Engineer | Microsoft MVP Data Platform | Microsoft MVP Power BI | Power BI Super User | C# Corner MVP

View solution in original post

2 REPLIES 2

amirabedhiafi
Contributor

Hello!

I would add a few more historical modeling patterns that often appear separately, even though they overlap with SCD2, CDC, or temporal joins.

One important case is bi-temporal modeling, where you need to separate business effective time from system or load time.

For example: what was the customer status on 1 March? versus what did we believe the customer status was on 1 March, based on what we knew on 5 March?

Another pattern is historical correction or restatement, where history itself changes because a source system corrects past records. This is different from a normal late-arriving dimension because the old historical truth may need to be restated.

CDC pipelines often focus on inserts and updates, but deletes are also a major historical modeling challenge. The question becomes: do we physically remove the record, soft-delete it, close the SCD2 row, or keep a tombstone event?

Another case is when a customer, product, account, or vehicle changes identifiers over time, or when multiple source identities later become one. This creates historical continuity problems: is this the same entity or a new one?

There is also hierarchy and relationship history, which means not only attributes changing, but relationships changing. For example: employee-to-manager, product-to-category, customer-to-segment, or sales-partner-to-region. Historical reporting often breaks when only the child entity is modeled as SCD2, but the relationship path is not.

Another pattern is grain evolution, where the level of detail changes over time. For example, old data exists monthly while new data exists daily, or old product data exists at brand level while new data exists at model level. This creates issues for reproducible trends.

One last situation is temporal conformance across sources, where multiple systems each have their own version of history and their own effective dates. The problem is not only “what was true then?”, but also “which system’s truth should win for that period?”

If this answer resolves your question, could you please mark it as “Accept as Solution”? It will help other users quickly find the correct fix.

Senior BI/Data Engineer | Microsoft MVP Data Platform | Microsoft MVP Power BI | Power BI Super User | C# Corner MVP

Thanks, this is a very thoughtful addition to the list.

I particularly like the distinction between late-arriving data and historical corrections/restatements. I had been treating them as similar problems, but they really lead to different modeling decisions.

Relationship history, identity evolution and temporal conformance also feel like important categories that I had not explicitly separated out.

What’s interesting is that despite the different technologies and implementations, many historical data challenges seem to cluster around a relatively small set of recurring patterns.

I’ve actually been experimenting with a small prototype to classify and explain some of these patterns. This discussion already gave me a few ideas for additional categories.