Databricks Community

Dharinip · ‎11-08-2024

We have the following use case:

We receive raw form of data from an application and that is ingested in the Iron Layer. The raw data is in the JSON Format

The Bronze layer will the first level of transformation. The flattening of the JSON file happens to be the first level of transformation.

The Silver layer is the second level of transformation that takes place. The final tables are formed in this layer. But all the tables are in the SCD type 2 Format. The historical changes have been retained.

---Here comes the final question.

We need to create a dataset that contains only the latest updated records in the final table. Should that be created as a table in the Gold layer or a view or a materialized view?

DHARINI PARGUNAN

madams · ‎11-19-2024

Oh no! I swear I wrote a reply out, but I must not have submitted it. Setting that aside...

I think in your case a materialized view makes the most sense, and it's what I'd go with. It's essentially precomputing the logic, but reusing some of the storage on disk. Definitely look into liquid clustering to help with performance on your common joins.

View solution in original post

madams · ‎11-08-2024

Hi Dharinip. I've had similar conversations internally with developers asking very similar things. This is my general advice for this situation, but I think there are a lot of considerations to how you create your gold layer.

How large is this dataset, and how much growth is expected? The larger the dataset gets, the slower view queries could become. If it's very large, I'd recommend materializing it as a table.
Is the silver table a MANAGED table in Delta format, or is it in another format? If not in Delta, this could potentially slow down the performance a view.
Are you able to add liquid cluster on your soft delete flag? This would allow Spark to more easily fetch just the live records in the event that you use a view instead of a table
Are you able to further enhance the gold table with data from additional data sources? If so, you would likely want to materialize as a table, so that you don't have a large number of joins in a gold level view. Joins are a very expensive operation, as far as processing time.

Ultimately, I'd try it both ways and gauge performance, because there are potentially other factors that I didn't even list. The upside of a view is that this data is really live, once it gets into silver. There's no further `MERGE` operation that needs to happen to the files on disk.

If you have answers to the above questions, I can try and formulate a more specific recommendation to your scenario!

Dharinip · ‎11-10-2024

Hi Madams,

Thank you for your reply.

Let me explain in detail about the use case that we have.

The raw data is ingested in the Iron layer on batch processing. The file would be in JSON format that we receive in batches.

IRON Layer - The first level where the data is been ingested. So, each and every batch file will be treated as one record in the Delta live table.

BRONZE Layer - The first level of transformation happens here which is the flattening of the JSON file to the objects. The

SILVER Layer - The delta actual tables are getting created here but all the tables are SCD Type 2. The historical changes have also been recorded with the data timestamp. The last updated record would be active record. For example: Consider a user details table

user_id	user_name	date_created	date_modified
123	ABC	11/01/2024	11/02/2024
123	ABC	11/02/2024	11/05/2024
123	ABC	11/05/2024	NULL
456	XYZ	11/01/2024	11/05/2024
456	XYZ	11/05/2024	NULL

The user table above is created in the Silver layer as streaming DLT Tables. The changes made will be recorded since they are SCD Type 2.

---GOLD Layer

I need help in deciding the Gold layer. My requirements are:

To create a dataset like in the Gold layer.

123	ABC	11/05/2024	NULL
456	XYZ	11/05/2024	NULL

- The final user dataset in the gold layer should have only the last updated records of each user. The user id should be the primary key.

- Sometimes I would have to make a join with other tables to bring some columns.

- The column names needs to be changed.

- Default values will be applied in NULL columns.

These are pretty much the required transformations for the dataset to be in Gold layer.

My questions are:

1. Should the dataset in the GOLD layer be created with streaming tables/ Materialized Views? The data will be keep on growing and relatively these tables are large datasets.

2. some QC checks can be applied in the future if not now.

3. Need to define the primary key constraints on the datasets.

DHARINI PARGUNAN

Dharinip · ‎11-10-2024

Hi,

To your question, the dataset is large and will be growing as well. In my use case the Silver layer contains a USER table(streaming Delta live table) like below:

USER table:

batchid	ID	Name	startdate	Enddate
akjfdhjsa	123	ABC	11/01/2024	11/02/2024
adjagfua	123	ABC	11/02/2024	11/05/2024
sdgfrbas	123	ABC	11/05/2024	NULL
hgasdfcg	567	XYZ	11/01/2024	11/05/2024
ahgfdhgb	567	XYZ	11/05/2024	NULL

In the Gold Layer we need a table like:

USER_ID	USER_NAME	USER_START_DATE	USER_END_DATE
123	ABC	11/05/2024	NULL
567	XYZ	11/05/2024	NULL

The transformations that has to be done in Gold Layer are:

1. Renaming of columns to the dataset

2. To add the PK/FK constraints to the dataset

3. To remove few columns like batchid in the dataset

4. To make the dataset SCD TYPE 1 in the Gold layer i.e to bring only the latest updated records of the user

5. Sometimes to make join with 1 or 2 tables and bring new columns

The question here is:

Should the dataset be created as a streaming Table or a materialized view?

DHARINI PARGUNAN

madams · ‎11-19-2024

Oh no! I swear I wrote a reply out, but I must not have submitted it. Setting that aside...

I think in your case a materialized view makes the most sense, and it's what I'd go with. It's essentially precomputing the logic, but reusing some of the storage on disk. Definitely look into liquid clustering to help with performance on your common joins.