Databricks Community

aliacovella · ‎01-29-2025

The following is my scenario:

I need to query on a daily basis from an external table that maintains a row version
I would like to be able to query for all records where the row version is greater than the max previously processed row version.
The source is not a constant streaming source

I've used the delta live tables from a streaming source and am familiar with how they checkpoint from a streaming source such as Kinesis, but much of that is handled internally.

If I need to handle check pointing myself, say to store the last max date processed or the last row version number processed then use that in the subsequent query as the starting point, what is the recommended approach.

I've come across the idea of using a separate view to maintain the max value for a given table where I can query that table to get the starting value for the next incremental batch, but I'm wondering if there are better approaches to this, or if I can somehow utilize the streaming table checkpoint mechanism for this.

VZLA · ‎01-29-2025

Hello, thank you for your question.

Since your source is not a continuous streaming source, you’ll need to implement custom checkpointing manually. The best approach is to store the maximum processed row version in a separate checkpoint table. Here’s a simple strategy:

Create a Checkpoint Table
Maintain a table (e.g., checkpoint_table) with a single row containing the last processed row version.
Retrieve the Last Processed Row Version
Query the checkpoint_table before each run to get the last processed row version.
Query New Records
Fetch records where row_version > last_processed_version.
Process the Data
After processing, update checkpoint_table with the new max row_version from the processed data.

You can implement this in a Databricks notebook using Delta Lake tables to ensure ACID compliance. While Delta Live Tables (DLT) has built-in checkpointing for streaming sources, this manual checkpointing approach is necessary for batch ingestion. Let me know if you need more details on implementation!

View solution in original post

VZLA · ‎01-29-2025

Hello, thank you for your question.

Since your source is not a continuous streaming source, you’ll need to implement custom checkpointing manually. The best approach is to store the maximum processed row version in a separate checkpoint table. Here’s a simple strategy:

Create a Checkpoint Table
Maintain a table (e.g., checkpoint_table) with a single row containing the last processed row version.
Retrieve the Last Processed Row Version
Query the checkpoint_table before each run to get the last processed row version.
Query New Records
Fetch records where row_version > last_processed_version.
Process the Data
After processing, update checkpoint_table with the new max row_version from the processed data.

You can implement this in a Databricks notebook using Delta Lake tables to ensure ACID compliance. While Delta Live Tables (DLT) has built-in checkpointing for streaming sources, this manual checkpointing approach is necessary for batch ingestion. Let me know if you need more details on implementation!

aliacovella · ‎01-29-2025

Thanks, I had considered the table storage approach also but wanted to see if there was an internally defined mechanism I could use. Thanks for your timely response.

jeremy98 · ‎01-29-2025

Hi,

I totally agree with VZLA, within my internal team we have a similar issue and we used a table to track the latest versions of each table, since we haven't a streaming process in our side. DLT pipelines could be a choice, but depends also if you want to be much free on using them.

Databricks Community

Custom Checkpointing

Join Us as a Local Community Builder!

🚀 Announcing the Databricks Data Intelligence Platform Cheat Sheet

Find Sensitive Data at Scale with Data Classification in Unity Catalog

Solution Accelerator Series | #6 - Adverse Drug Event Detection

Announcing Backfill Runs in Lakeflow Jobs for Higher Quality Downstream Data

🚀 New: Databricks Interactive Architecture Design Workshops