Databricks Community

guidotognini · 3 weeks ago

Hi everyone,

I’m building a Medallion-style pipeline on Databricks for nested JSON API responses and I’d like advice on the design of an intermediate “exploded” step:

Can I avoid materializing it as a table?
If not, how should I name/classify it in the Medallion model?

Current pattern (DLT + SQL)

Landing (Volumes)
- I extract API responses as JSON files and land them in a Databricks Volume.
Bronze (raw JSON → table)
- A DLT pipeline (SQL) reads those JSONs and creates a bronze streaming table.
- The goal is to store them “as they are,” just normalizing JSON into columns.
- One of the fields is an array of nested records, but in bronze I keep this field as a STRING (the raw JSON text of the array).
- Schema evolution and _rescued_data are enabled.
Exploded / normalized (intermediate)
- In the next DLT SQL step, I parse that string as an ARRAY with an explicit schema and explode it.
- I also flatten nested structs and cast types.
- This is currently a streaming table, i.e., a materialized step.
Silver with CDC
- On top of the exploded table, I apply a CDC pattern using the ingestion timestamp of the JSON as the sequence column.
- For each key (e.g. entity_id, date), I keep only the row with the most recent ingestion timestamp, so I get the latest and most up-to-date metric for that date.
- This produces a “clean” silver table with one record per key/date.

So overall:

raw JSON (Volume) → bronze (raw-ish) → exploded/normalized → CDC-consolidated silver

What I’d like to understand

Can I avoid a materialized exploded table?
Concretely, is it possible / a better approach to:
- Implement the explode + normalization as a materialized view instead of a table?
- Use Python in DLT to keep this as a logical/view-like transformation rather than a persisted table?
- Push the explode directly into the silver/CDC step, so there is no separate exploded object at all?

(I'm not sure which of these options are actually supported/best in practice, so I’m asking both “is it possible?” and “is it better?”)

If I do keep it, how would you classify/name it in Medallion terms?
The exploded table is:
- No longer raw JSON (so not pure bronze),
- Not yet the final silver state (it’s pre-CDC).

Would you typically treat this as:

a “bronze_normalized” / “bronze_prepared” table,
a generic “staging” layer between bronze and silver, or
a “silver_staging” table that conceptually belongs to silver?

I’m looking for patterns others use for:

Volumes with raw JSON → DLT bronze → explode nested array → CDC silver (latest per key/date)

…both in terms of:

whether to materialize the exploded step at all, and
how to name and position that step in a clean Medallion design.

Thanks in advance for any insights!

#LakeflowDeclarativePipelines #DLT #MedallionArchitecture

ManojkMohan · 3 weeks ago

@guidotognini

1. Can You Avoid Materializing Exploded Data?

Materialized Views: If your downstream silver table is itself streaming and supports materialized views, you may be able to collapse the explode+normalize step into a view that directly transforms bronze to silver without a separate persisted table

DLT with Python or SQL can chain transformations without persisting each intermediate to storage. However, skipping materialization risks incomplete lineage, difficult debugging, and slower downstream query performance

One can push explode+normalize directly into the silver table transformation, with CDC logic applied in a single step. This reduces the number of tables, but may increase complexity and reduce auditability; if you ever need to reprocess or troubleshoot, having the exploded step materialized helps

Suggested actions:

Materialize the exploded/normalized step as a managed silver staging table if reliability, traceability, and modularity are priorities.
If your pipeline is simple, rarely needs isolated queries on the exploded array, and you don't require stepwise time-travel/debugging, consider using a view or merging normalization into the CDC silver step.
Use Medallion naming conventions that clearly indicate the purpose (e.g., "silver_staging" or "silver_normalized").
Document lineage to ensure that downstream consumers understand which objects are audited system of record (bronze) and which are computable/corrected intermediates (silver)https://www.linkedin.com/pulse/flattening-json-data-databricks-downstream-processing-avinash-narala-...

https://tecyfy.com/blog/medallion-architecture-best-practices-databricks-technical-deep-dive

View solution in original post

ManojkMohan · 3 weeks ago