cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Lakehouse sync tables over rolling history

leopold_cudzik
New Contributor II

Hi,
we're exploring replacing one of the use cases we are running in our clour provider with a Databricks pipelines. We currently have explored possibility to subscribe to an eventhub using SDP pipelines, feedding our iot data into a Delta table where we keep full history.
This is impractical for usage in web applications for high latency. Our idea is to use Lakehouse synced table, however we don't want to sync with full history (too much data). Current idea is to have a dedicated delta table which would be filled by an SDP stream (to contain iot data near real-time), and then perhaps to be periodically cleaned to only contain last 7 days of data. That table would then be synced. There are all kinds of pitfalls  - combinining streaming ingestion with batch clean-up, syncing to Lakehose. Is this idea feasible? If not, how could these requirements be fullfiled?

1 ACCEPTED SOLUTION

Accepted Solutions

Ashwin_DSA
Databricks Employee
Databricks Employee

Hi @leopold_cudzik,

The pattern you are suggesting is feasible, but it’s much easier to manage if you separate history ingestion from the 7-day serving view instead of cleaning the streaming sink table in place.

A common architecture on Databricks would look like the below...

Bronze (full history, not synced): Event Hub > SDP stream > bronze.iot_events_history (append-only Delta). This is your long-term history for analytics/compliance.

Silver last 7 days (synced): A second SDP pipeline (or streaming table) reads from bronze.iot_events_history and writes to silver.iot_events_last_7d, enforcing a 7-day window (via event time filter and/or watermarks). A simple scheduled job can periodically delete rows older than 7 days:

DELETE FROM silver.iot_events_last_7d
WHERE event_time < current_timestamp() - INTERVAL 7 DAYS;
Synced table for the app: Create a Lakehouse synced table from silver.iot_events_last_7d into Lakebase Postgres and point your web app to that Postgres table. When the 7-day cleanup runs on the Delta source, those deletions propagate through the sync, so the app only ever sees the rolling window.

This avoids most pitfalls. The ingestion stream stays append-only and simple, the 7-day logic lives in a derived table you control, and the Lakehouse synced table only has the small, recent slice you need for low-latency web queries.

Hopefully, the above answers your main question. However, if the primary system of record for the IoT data is operational (web apps) and analytics are secondary, flip the direction as below.

Ingest IoT directly into Lakebase Postgres (via EventHub consumer, Data API, or app code). Use Lakehouse Sync (Lakebase > Delta) to continuously replicate into Unity Catalog as lb_iot_events_history with full SCD2 history in the lakehouse. Keep only 7 days of data in Lakebase (periodic deletes), while the lakehouse table keeps full history. This pattern (subset in Lakebase, full history in the lakehouse) is an explicit Lakehouse Sync use case.

This is best when Lakebase is your operational DB anyway and the lakehouse is downstream analytics.

If this answer resolves your question, could you mark it as “Accept as Solution”? That helps other users quickly find the correct fix.

Regards,
Ashwin | Delivery Solution Architect @ Databricks
Helping you build and scale the Data Intelligence Platform.
***Opinions are my own***

View solution in original post

1 REPLY 1

Ashwin_DSA
Databricks Employee
Databricks Employee

Hi @leopold_cudzik,

The pattern you are suggesting is feasible, but it’s much easier to manage if you separate history ingestion from the 7-day serving view instead of cleaning the streaming sink table in place.

A common architecture on Databricks would look like the below...

Bronze (full history, not synced): Event Hub > SDP stream > bronze.iot_events_history (append-only Delta). This is your long-term history for analytics/compliance.

Silver last 7 days (synced): A second SDP pipeline (or streaming table) reads from bronze.iot_events_history and writes to silver.iot_events_last_7d, enforcing a 7-day window (via event time filter and/or watermarks). A simple scheduled job can periodically delete rows older than 7 days:

DELETE FROM silver.iot_events_last_7d
WHERE event_time < current_timestamp() - INTERVAL 7 DAYS;
Synced table for the app: Create a Lakehouse synced table from silver.iot_events_last_7d into Lakebase Postgres and point your web app to that Postgres table. When the 7-day cleanup runs on the Delta source, those deletions propagate through the sync, so the app only ever sees the rolling window.

This avoids most pitfalls. The ingestion stream stays append-only and simple, the 7-day logic lives in a derived table you control, and the Lakehouse synced table only has the small, recent slice you need for low-latency web queries.

Hopefully, the above answers your main question. However, if the primary system of record for the IoT data is operational (web apps) and analytics are secondary, flip the direction as below.

Ingest IoT directly into Lakebase Postgres (via EventHub consumer, Data API, or app code). Use Lakehouse Sync (Lakebase > Delta) to continuously replicate into Unity Catalog as lb_iot_events_history with full SCD2 history in the lakehouse. Keep only 7 days of data in Lakebase (periodic deletes), while the lakehouse table keeps full history. This pattern (subset in Lakebase, full history in the lakehouse) is an explicit Lakehouse Sync use case.

This is best when Lakebase is your operational DB anyway and the lakehouse is downstream analytics.

If this answer resolves your question, could you mark it as “Accept as Solution”? That helps other users quickly find the correct fix.

Regards,
Ashwin | Delivery Solution Architect @ Databricks
Helping you build and scale the Data Intelligence Platform.
***Opinions are my own***