Medallion Architecture: do I need a materialized “exploded” layer (raw JSON → exploded → CDC)?

guidotognini
New Contributor II

Hi everyone,

I’m building a Medallion-style pipeline on Databricks for nested JSON API responses and I’d like advice on the design of an intermediate “exploded” step:

  • Can I avoid materializing it as a table?

  • If not, how should I name/classify it in the Medallion model?


Current pattern (DLT + SQL)

  1. Landing (Volumes)

    • I extract API responses as JSON files and land them in a Databricks Volume.

  2. Bronze (raw JSON → table)

    • A DLT pipeline (SQL) reads those JSONs and creates a bronze streaming table.

    • The goal is to store them “as they are,” just normalizing JSON into columns.

    • One of the fields is an array of nested records, but in bronze I keep this field as a STRING (the raw JSON text of the array).

    • Schema evolution and _rescued_data are enabled.

  3. Exploded / normalized (intermediate)

    • In the next DLT SQL step, I parse that string as an ARRAY with an explicit schema and explode it.

    • I also flatten nested structs and cast types.

    • This is currently a streaming table, i.e., a materialized step.

  4. Silver with CDC

    • On top of the exploded table, I apply a CDC pattern using the ingestion timestamp of the JSON as the sequence column.

    • For each key (e.g. entity_id, date), I keep only the row with the most recent ingestion timestamp, so I get the latest and most up-to-date metric for that date.

    • This produces a “clean” silver table with one record per key/date.

So overall:

raw JSON (Volume) → bronze (raw-ish) → exploded/normalized → CDC-consolidated silver


What I’d like to understand

  1. Can I avoid a materialized exploded table?
    Concretely, is it possible / a better approach to:

    • Implement the explode + normalization as a materialized view instead of a table?

    • Use Python in DLT to keep this as a logical/view-like transformation rather than a persisted table?

    • Push the explode directly into the silver/CDC step, so there is no separate exploded object at all?

(I'm not sure which of these options are actually supported/best in practice, so I’m asking both “is it possible?” and “is it better?”)

  1. If I do keep it, how would you classify/name it in Medallion terms?
    The exploded table is:

    • No longer raw JSON (so not pure bronze),

    • Not yet the final silver state (it’s pre-CDC).

Would you typically treat this as:

  • a “bronze_normalized” / “bronze_prepared” table,

  • a generic “staging” layer between bronze and silver, or

  • a “silver_staging” table that conceptually belongs to silver?

I’m looking for patterns others use for:

Volumes with raw JSON → DLT bronze → explode nested array → CDC silver (latest per key/date)

…both in terms of:

  • whether to materialize the exploded step at all, and

  • how to name and position that step in a clean Medallion design.

Thanks in advance for any insights!

 
#LakeflowDeclarativePipelines #DLT #MedallionArchitecture