11-08-2024 12:04 PM
We have the following use case:
We receive raw form of data from an application and that is ingested in the Iron Layer. The raw data is in the JSON Format
The Bronze layer will the first level of transformation. The flattening of the JSON file happens to be the first level of transformation.
The Silver layer is the second level of transformation that takes place. The final tables are formed in this layer. But all the tables are in the SCD type 2 Format. The historical changes have been retained.
---Here comes the final question.
We need to create a dataset that contains only the latest updated records in the final table. Should that be created as a table in the Gold layer or a view or a materialized view?
11-19-2024 11:20 AM
Oh no! I swear I wrote a reply out, but I must not have submitted it. Setting that aside...
I think in your case a materialized view makes the most sense, and it's what I'd go with. It's essentially precomputing the logic, but reusing some of the storage on disk. Definitely look into liquid clustering to help with performance on your common joins.
11-08-2024 03:14 PM
Hi Dharinip. I've had similar conversations internally with developers asking very similar things. This is my general advice for this situation, but I think there are a lot of considerations to how you create your gold layer.
Ultimately, I'd try it both ways and gauge performance, because there are potentially other factors that I didn't even list. The upside of a view is that this data is really live, once it gets into silver. There's no further `MERGE` operation that needs to happen to the files on disk.
If you have answers to the above questions, I can try and formulate a more specific recommendation to your scenario!
11-10-2024 03:53 PM
Hi Madams,
Thank you for your reply.
Let me explain in detail about the use case that we have.
The raw data is ingested in the Iron layer on batch processing. The file would be in JSON format that we receive in batches.
IRON Layer - The first level where the data is been ingested. So, each and every batch file will be treated as one record in the Delta live table.
BRONZE Layer - The first level of transformation happens here which is the flattening of the JSON file to the objects. The
SILVER Layer - The delta actual tables are getting created here but all the tables are SCD Type 2. The historical changes have also been recorded with the data timestamp. The last updated record would be active record. For example: Consider a user details table
user_id | user_name | date_created | date_modified |
123 | ABC | 11/01/2024 | 11/02/2024 |
123 | ABC | 11/02/2024 | 11/05/2024 |
123 | ABC | 11/05/2024 | NULL |
456 | XYZ | 11/01/2024 | 11/05/2024 |
456 | XYZ | 11/05/2024 | NULL |
The user table above is created in the Silver layer as streaming DLT Tables. The changes made will be recorded since they are SCD Type 2.
---GOLD Layer
I need help in deciding the Gold layer. My requirements are:
To create a dataset like in the Gold layer.
123 | ABC | 11/05/2024 | NULL |
456 | XYZ | 11/05/2024 | NULL |
- The final user dataset in the gold layer should have only the last updated records of each user. The user id should be the primary key.
- Sometimes I would have to make a join with other tables to bring some columns.
- The column names needs to be changed.
- Default values will be applied in NULL columns.
These are pretty much the required transformations for the dataset to be in Gold layer.
My questions are:
1. Should the dataset in the GOLD layer be created with streaming tables/ Materialized Views? The data will be keep on growing and relatively these tables are large datasets.
2. some QC checks can be applied in the future if not now.
3. Need to define the primary key constraints on the datasets.
11-10-2024 07:42 PM
Hi,
To your question, the dataset is large and will be growing as well. In my use case the Silver layer contains a USER table(streaming Delta live table) like below:
USER table:
batchid | ID | Name | startdate | Enddate |
akjfdhjsa | 123 | ABC | 11/01/2024 | 11/02/2024 |
adjagfua | 123 | ABC | 11/02/2024 | 11/05/2024 |
sdgfrbas | 123 | ABC | 11/05/2024 | NULL |
hgasdfcg | 567 | XYZ | 11/01/2024 | 11/05/2024 |
ahgfdhgb | 567 | XYZ | 11/05/2024 | NULL |
In the Gold Layer we need a table like:
USER_ID | USER_NAME | USER_START_DATE | USER_END_DATE |
123 | ABC | 11/05/2024 | NULL |
567 | XYZ | 11/05/2024 | NULL |
The transformations that has to be done in Gold Layer are:
1. Renaming of columns to the dataset
2. To add the PK/FK constraints to the dataset
3. To remove few columns like batchid in the dataset
4. To make the dataset SCD TYPE 1 in the Gold layer i.e to bring only the latest updated records of the user
5. Sometimes to make join with 1 or 2 tables and bring new columns
The question here is:
Should the dataset be created as a streaming Table or a materialized view?
11-19-2024 11:20 AM
Oh no! I swear I wrote a reply out, but I must not have submitted it. Setting that aside...
I think in your case a materialized view makes the most sense, and it's what I'd go with. It's essentially precomputing the logic, but reusing some of the storage on disk. Definitely look into liquid clustering to help with performance on your common joins.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group