- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
3 weeks ago
The following is my scenario:
- I need to query on a daily basis from an external table that maintains a row version
- I would like to be able to query for all records where the row version is greater than the max previously processed row version.
- The source is not a constant streaming source
I've used the delta live tables from a streaming source and am familiar with how they checkpoint from a streaming source such as Kinesis, but much of that is handled internally.
If I need to handle check pointing myself, say to store the last max date processed or the last row version number processed then use that in the subsequent query as the starting point, what is the recommended approach.
I've come across the idea of using a separate view to maintain the max value for a given table where I can query that table to get the starting value for the next incremental batch, but I'm wondering if there are better approaches to this, or if I can somehow utilize the streaming table checkpoint mechanism for this.
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
3 weeks ago
Hello, thank you for your question.
Since your source is not a continuous streaming source, you’ll need to implement custom checkpointing manually. The best approach is to store the maximum processed row version in a separate checkpoint table. Here’s a simple strategy:
-
Create a Checkpoint Table
Maintain a table (e.g.,checkpoint_table
) with a single row containing the last processed row version. -
Retrieve the Last Processed Row Version
Query thecheckpoint_table
before each run to get the last processed row version. -
Query New Records
Fetch records whererow_version > last_processed_version
. -
Process the Data
After processing, updatecheckpoint_table
with the new maxrow_version
from the processed data.
You can implement this in a Databricks notebook using Delta Lake tables to ensure ACID compliance. While Delta Live Tables (DLT) has built-in checkpointing for streaming sources, this manual checkpointing approach is necessary for batch ingestion. Let me know if you need more details on implementation!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
3 weeks ago
Hello, thank you for your question.
Since your source is not a continuous streaming source, you’ll need to implement custom checkpointing manually. The best approach is to store the maximum processed row version in a separate checkpoint table. Here’s a simple strategy:
-
Create a Checkpoint Table
Maintain a table (e.g.,checkpoint_table
) with a single row containing the last processed row version. -
Retrieve the Last Processed Row Version
Query thecheckpoint_table
before each run to get the last processed row version. -
Query New Records
Fetch records whererow_version > last_processed_version
. -
Process the Data
After processing, updatecheckpoint_table
with the new maxrow_version
from the processed data.
You can implement this in a Databricks notebook using Delta Lake tables to ensure ACID compliance. While Delta Live Tables (DLT) has built-in checkpointing for streaming sources, this manual checkpointing approach is necessary for batch ingestion. Let me know if you need more details on implementation!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
3 weeks ago
Thanks, I had considered the table storage approach also but wanted to see if there was an internally defined mechanism I could use. Thanks for your timely response.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
3 weeks ago
Hi,
I totally agree with VZLA, within my internal team we have a similar issue and we used a table to track the latest versions of each table, since we haven't a streaming process in our side. DLT pipelines could be a choice, but depends also if you want to be much free on using them.
![](/skins/images/582998B45490C7019731A5B3A872C751/responsive_peak/images/icon_anonymous_message.png)
![](/skins/images/582998B45490C7019731A5B3A872C751/responsive_peak/images/icon_anonymous_message.png)