Hello!
I am currently exploring the possibility of implementing incremental changes in our company's ETL pipeline and looking into Change Data Feed option. There are a couple of challenges I'm uncertain about.
For instance, we have a piece of logic like this:
lag(is_available, 1, date '1970-01-01') over (partition by store, product_id order by is_available) as was_available_yesterday
Another case involves calculating the most popular quantity sold:
first_value(quantity) over (partition by sales_date, store, product order by count(*) desc) as quantity_mode
In both cases, I need to reference historical data. What would be the best approach to handle such scenarios? Should I create a separate table to store values for each unique combination of store and product, or should I join with a history table? I'm concerned about performance in the second case, as scanning the entire history table could be costly, especially considering that I have several fields that rely on historical context.