cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

How to decide on creating views vs Tables in Gold layer?

Dharinip
New Contributor III

We have the following use case:

We receive raw form of data from an application and that is ingested in the Iron Layer. The raw data is in the JSON Format

The Bronze layer will the first level of transformation. The flattening of the JSON file happens to be the first level of transformation.

The Silver layer is the second level of transformation that takes place. The final tables are formed in this layer. But all the tables are in the SCD type 2 Format. The historical changes have been retained.

---Here comes the final question.

We need to create a dataset that contains only the latest updated records in the final table. Should that be created as a table in the Gold layer or a view or a materialized view?

 

 

 

1 ACCEPTED SOLUTION

Accepted Solutions

madams
Contributor

Oh no!  I swear I wrote a reply out, but I must not have submitted it.  Setting that aside...

I think in your case a materialized view makes the most sense, and it's what I'd go with.  It's essentially precomputing the logic, but reusing some of the storage on disk.  Definitely look into liquid clustering to help with performance on your common joins.

View solution in original post

4 REPLIES 4

madams
Contributor

Hi Dharinip.  I've had similar conversations internally with developers asking very similar things.  This is my general advice for this situation, but I think there are a lot of considerations to how you create your gold layer.

  1. How large is this dataset, and how much growth is expected?  The larger the dataset gets, the slower view queries could become.  If it's very large, I'd recommend materializing it as a table.
  2. Is the silver table a MANAGED table in Delta format, or is it in another format?  If not in Delta, this could potentially slow down the performance a view.
  3. Are you able to add liquid cluster on your soft delete flag?  This would allow Spark to more easily fetch just the live records in the event that you use a view instead of a table
  4. Are you able to further enhance the gold table with data from additional data sources?  If so, you would likely want to materialize as a table, so that you don't have a large number of joins in a gold level view.  Joins are a very expensive operation, as far as processing time.

Ultimately, I'd try it both ways and gauge performance, because there are potentially other factors that I didn't even list. The upside of a view is that this data is really live, once it gets into silver.  There's no further `MERGE` operation that needs to happen to the files on disk.

If you have answers to the above questions, I can try and formulate a more specific recommendation to your scenario!

Dharinip
New Contributor III

Hi Madams,

Thank you for your reply. 

Let me explain in detail about the use case that we have.

The raw data is ingested in the Iron layer on batch processing. The file would be in JSON format that we receive in batches. 

IRON Layer - The first level where the data is been ingested. So, each and every batch file will be treated as one record in the Delta live table.

BRONZE Layer - The first level of transformation happens here which is the flattening of the JSON file to the objects. The

SILVER Layer - The delta actual tables are getting created here but all the tables are SCD Type 2. The historical changes have also been recorded with the data timestamp. The last updated record would be active record. For example: Consider a user details table

 

user_iduser_namedate_createddate_modified
123ABC11/01/202411/02/2024
123ABC11/02/202411/05/2024
123ABC11/05/2024NULL
456XYZ11/01/202411/05/2024
456XYZ11/05/2024NULL

The user table above is created in the Silver layer as streaming DLT Tables. The changes made will be recorded since they are SCD Type 2.

---GOLD Layer

I need help in deciding the Gold layer. My requirements are:

To create a dataset like in the Gold layer.

123ABC11/05/2024NULL
456XYZ11/05/2024NULL

- The final user dataset  in the gold layer should have only the last updated records of each user. The user id should be the primary key.

- Sometimes I would have to make a join with other tables to bring some columns.

- The column names needs to be changed.

- Default values will be applied in NULL columns. 

These are pretty much the required transformations for the dataset to be in Gold layer.

My questions are:

1. Should the dataset in the GOLD layer be created with streaming tables/ Materialized Views? The data will be keep on growing and relatively these tables are large datasets.

2. some QC checks can be applied in the future if not now. 

3. Need to define the primary key constraints on the datasets.

 

 

 

Dharinip
New Contributor III

Hi,

To your question, the dataset is large and will be growing as well. In my use case the Silver layer contains a USER table(streaming Delta live table) like below:

USER table:

batchidIDNamestartdateEnddate
akjfdhjsa123ABC11/01/202411/02/2024
adjagfua123ABC11/02/202411/05/2024
sdgfrbas123ABC11/05/2024NULL
hgasdfcg567XYZ11/01/202411/05/2024
ahgfdhgb567XYZ11/05/2024NULL
     

In the Gold Layer we need a table like:

USER_IDUSER_NAMEUSER_START_DATEUSER_END_DATE
123ABC11/05/2024NULL
567XYZ11/05/2024NULL

The transformations that has to be done in Gold Layer are:

1. Renaming of columns to the dataset 

2. To add the PK/FK constraints to the dataset

3. To remove few columns like batchid in the dataset

4. To make the dataset SCD TYPE 1 in the Gold layer i.e to bring only the latest updated records of the user 

5. Sometimes to make join with 1 or 2 tables and bring new columns

The question here is:

Should the dataset be created as a streaming Table or a materialized view?

 

madams
Contributor

Oh no!  I swear I wrote a reply out, but I must not have submitted it.  Setting that aside...

I think in your case a materialized view makes the most sense, and it's what I'd go with.  It's essentially precomputing the logic, but reusing some of the storage on disk.  Definitely look into liquid clustering to help with performance on your common joins.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group