Window functions in Change Data Feed

mike_engineer — Thu, 05 Oct 2023 17:01:11 GMT

Hello!

I am currently exploring the possibility of implementing incremental changes in our company's ETL pipeline and looking into Change Data Feed option. There are a couple of challenges I'm uncertain about.

For instance, we have a piece of logic like this:

lag(is_available, 1, date '1970-01-01') over (partition by store, product_id order by is_available) as was_available_yesterday

Another case involves calculating the most popular quantity sold:

first_value(quantity) over (partition by sales_date, store, product order by count(*) desc) as quantity_mode

In both cases, I need to reference historical data. What would be the best approach to handle such scenarios? Should I create a separate table to store values for each unique combination of store and product, or should I join with a history table? I'm concerned about performance in the second case, as scanning the entire history table could be costly, especially considering that I have several fields that rely on historical context.

topic Window functions in Change Data Feed in Data Engineering

Window functions in Change Data Feed