Hi @deepu1 ,
Assuming that @dlt.table refers to a Materialized View (MV), you are correct that this is the standard way to create aggregated tables in the Gold layer. A Materialized View is essentially a table that stores the results of a specific query.
Because of this, it is impossible to "ignore" certain rows in the source table or perform a partial update (e.g., only updating the current month) within a standard MV. Doing so would break the fundamental promise of a Materialized View: that the data consistently reflects the results of its defining query.
Regarding your attempt with AUTO CDC (formerly APPLY_CHANGES INTO😞 this feature is designed to update rows based on a primary key (UPSERT logic) rather than aggregating data. Therefore, it is not the appropriate tool for your Gold layer requirements.
To handle partial updates and ensure that late-arriving data from previous months does not change your historical Gold records, you must ensure that this data is filtered before or during the aggregation process. You can achieve this in two ways:
-
Filter at the Bronze/Silver Layer: Filter out late-arriving rows during the ingestion process. For example, if month(updated_timestamp) < month(current_date()), the row is considered late and is dropped. This is a good solution if you can safely ignore late-arriving data entirely.
-
Filter in the Gold Layer Definition: Introduce an current_date() AS ingestion_date column in your Bronze/Silver layer to track when a row was first processed. In your Gold MV definition, apply a filter: WHERE month(updated_timestamp) == month(ingestion_date). This ensures that if a row arrives in February but has a January timestamp, it is excluded from the aggregation.
In both scenarios, you can continue using Materialized Views for your Gold layer aggregations while ensuring that previous months remain unchanged. This approach aligns perfectly with the "declarative" nature of Lakeflow Spark Declarative Pipelines.
Hope it helps!