cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

JSON Medallion Best Practices

Kayla
Valued Contributor II

I'm looking at ingesting JSON files from an API, pulling a list of orders. Each JSON file has header information and then a nested array of items - I want to flatten this into a table with 1 row/item and the header repeated for every item.

What is the best practice for doing this with medallion architecture? Specifically, when does it make sense to explode the JSON. Normally we enforce schema on Bronze to match the source, but we've mostly been pulling from highly structured databases, and I can see arguments for having Bronze just be a table with 1 column JSON files, and others with any metadata we want to add. We'd then handle the nested schema on write to Silver.

Alternatively, we could immediately apply the exploded schema on write to Bronze.

1 REPLY 1

Coffee77
Contributor III

I would need to know a little more about your scenario but it makes me remember a similar case I faced. My approach was to use silver layer to create a delta table with enforced schema, standard field names and types, etc. to perform typical actions such as clean, adjust or even run quality checks before moving aggregated data to the gold layer.

So, what about bronze layer? We decided to only store files in a volume in a catalog placed in bronze layer along with a delta table to register related JSON metadata info along with URI. In our use case, we just process file by file for a lot of clients sending data with very different schemas, so why to have a bronze table with thousand of fields not being processed jointly or a table per client with same structure as files? No sense at all. If we need to recover bronze data, we basically have the registered file and where it is.

Irrespective of deciding put the JSON as a field in a delta table or outside in the way we did, go to siilver layer with standardized schema and files to clean, adjust, run quality checks, etc.

I hope this helps.


Lifelong Learner Cloud & Data Solution Architect | https://www.youtube.com/@CafeConData