cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Window functions in Change Data Feed

mike_engineer
New Contributor
Hello!

I am currently exploring the possibility of implementing incremental changes in our company's ETL pipeline and looking into Change Data Feed option. There are a couple of challenges I'm uncertain about.

For instance, we have a piece of logic like this:

 

lag(is_available, 1, date '1970-01-01') over (partition by store, product_id order by is_available) as was_available_yesterday

 


Another case involves calculating the most popular quantity sold:

 

first_value(quantity) over (partition by sales_date, store, product order by count(*) desc) as quantity_mode

 

In both cases, I need to reference historical data. What would be the best approach to handle such scenarios? Should I create a separate table to store values for each unique combination of store and product, or should I join with a history table? I'm concerned about performance in the second case, as scanning the entire history table could be costly, especially considering that I have several fields that rely on historical context.

1 REPLY 1

Kaniz
Community Manager
Community Manager

Hi @mike_engineer ,

- Use the Change Data Feed feature in Databricks to track row-level changes in a Delta table.
- Change Data Feed records change events for all data written into the table, including row data and metadata.


- Use case scenarios:
  1. Track is_available status changes for each store and product_id using Change Data Feed. This helps determine if a product was available the previous day without scanning the entire history. Code: lag(is_available, 1, date '1970-01-01') over (partition by store, product_id order by is_available) as was_available_yesterday
  2. Track changes in quantity for each unique combination of sales_datestore, and product using Change Data Feed. This helps calculate the most popular quantity sold without scanning the entire history. Code: first_value(quantity) over (partition by sales_date, store, product order by count(*) desc) as quantity_mode
- Avoid creating a separate table or joining a history table to improve performance.
- Use Change Data Feed to process changes and update calculations incrementally.
- Change Data Feed works with table history to provide change information.

Cloning a Delta table creates a separate history. Hence, changing the data feed on cloned tables doesn't match that of the original table.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.