Databricks

DataBricks_Use1 · ‎01-31-2023

Hi All,

In our ETL Framework, we have four layers Raw, Foundation ,Trusted & Unified .In raw we are copying the file in JSON Format from a source, using ADF pipeline .In the next Layer(i.e. Foundation) we are flattening the Json Files and converting them into Delta Tables (Only Incremental data is appended here) .In the Trusted Layer we are doing UPSERT.

Incremental load is performed using the Modified date which is present in the Source Tables. We have one Parent Table and multiple Child Tables ,if there is any update on child table modified date gets updated in the Child Table only but Parent Table modified date doesn't get updated. Now in the Unified layer we have Dimension & Fact Tables. To refresh the Dim and Fact Tables we have written queries joining required parent and child tables and added all the business logic in one single sql statement. In the second fact table if we need same derived field then again we have to add the same logic in the second fact query( which will refresh the fact table). So due to this we decided to create separate tables in Trusted Layer which will have all the derived columns and In the fact query we will directly refer the derived columns from Trusted Layer. I am not sure this is a good idea or not.

Please let me know. Also, for derived tables we are exploring the DLT which will get refreshed automatically. So DLT table with full refresh is working fine . Now I want to make it incremental refresh. E.g.

Parent Table "Case"

Fields

Id,

Case_Name,

Modified_Date

Child Table "Case Product"

Fields

Id

Product Name

Modified_Date

Target (Output)

Id,

Product Name ( Comma Seperated)

Please note that any change in product name in Child table Modified date wouldn't get updated in Parent Table.

Please help .

Thanks in Advance,

Anonymous · ‎03-31-2023

@DataBricks_User9 c :

It seems like you have a well-defined ETL framework with different layers for processing data. Regarding your question about creating separate tables in the Trusted Layer for derived columns, it can be a good idea as it allows you to centralize the logic for calculating derived columns in one place, which can make it easier to maintain and update the logic if needed. It also allows you to avoid repeating the same logic in multiple queries, which can improve query performance and reduce the risk of errors.

As for making the DLT table with incremental refresh, you can use the "MERGE INTO" statement in Delta Lake to perform an "upsert" operation based on the Modified_Date column. You can use the Modified_Date column in the Child table to determine which rows need to be updated or inserted in the target table. Here's an example SQL statement that you can use as a starting point:

MERGE INTO target_table t
USING (
  SELECT c.Id, concat_ws(',', collect_list(c.Product_Name)) AS Product_Names
  FROM child_table c
  JOIN parent_table p ON c.Id = p.Id
  WHERE c.Modified_Date > t.max_modified_date -- only process new or updated rows
  GROUP BY c.Id
) s
ON t.Id = s.Id
WHEN MATCHED THEN
  UPDATE SET t.Product_Name = s.Product_Names
WHEN NOT MATCHED THEN
  INSERT (Id, Product_Name) VALUES (s.Id, s.Product_Names)

In this example, we first join the child and parent tables based on the Id column, then group by the Id and use the "collect_list" function to concatenate the Product_Name values into a comma-separated string. We then use the "MERGE INTO" statement to update or insert the rows into the target table based on the Id column. We only process rows that have a Modified_Date in the child table that is greater than the maximum Modified_Date in the target table, which ensures that we only process new or updated rows.

I hope this helps! Let me know if you have any other questions.

Anonymous · ‎04-08-2023

Hi @DataBricks_User9 c

Thank you for posting your question in our community! We are happy to assist you.

To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers your question?

This will also help other community members who may have similar questions in the future. Thank you for your participation and let us know if you need any further assistance!

Databricks

DLT live Table-Incremental Refresh

How to successfully build GenAI applications

Registration now open! Databricks Data + AI Summit 2024

Meet DBRX, the New Standard for High-Quality LLMs

Data Warehousing in the Era of AI