cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Custom Checkpointing

aliacovella
Contributor

The following is my scenario:

  • I need to query on a daily basis from an external table that maintains a row version
  • I would like to be able to query for all records where the row version is greater than the max previously processed row version. 
  • The source is not a constant streaming source

I've used the delta live tables from a streaming source and am familiar with how they checkpoint from a streaming source such as Kinesis, but much of that is handled internally.  

If I need to handle check pointing myself, say to store the last max date processed or the last row version number processed then use that in the subsequent query as the starting point, what is the recommended approach. 

I've come across the idea of using a separate view to maintain the max value for a given table where I can query that table to get the starting value for the next incremental batch, but I'm wondering if there are better approaches to this, or if I can somehow utilize the streaming table checkpoint mechanism for this. 

 

1 ACCEPTED SOLUTION

Accepted Solutions

VZLA
Databricks Employee
Databricks Employee

Hello, thank you for your question.

Since your source is not a continuous streaming source, you’ll need to implement custom checkpointing manually. The best approach is to store the maximum processed row version in a separate checkpoint table. Here’s a simple strategy:

  1. Create a Checkpoint Table
    Maintain a table (e.g., checkpoint_table) with a single row containing the last processed row version.

  2. Retrieve the Last Processed Row Version
    Query the checkpoint_table before each run to get the last processed row version.

  3. Query New Records
    Fetch records where row_version > last_processed_version.

  4. Process the Data
    After processing, update checkpoint_table with the new max row_version from the processed data.

You can implement this in a Databricks notebook using Delta Lake tables to ensure ACID compliance. While Delta Live Tables (DLT) has built-in checkpointing for streaming sources, this manual checkpointing approach is necessary for batch ingestion. Let me know if you need more details on implementation!

View solution in original post

3 REPLIES 3

VZLA
Databricks Employee
Databricks Employee

Hello, thank you for your question.

Since your source is not a continuous streaming source, you’ll need to implement custom checkpointing manually. The best approach is to store the maximum processed row version in a separate checkpoint table. Here’s a simple strategy:

  1. Create a Checkpoint Table
    Maintain a table (e.g., checkpoint_table) with a single row containing the last processed row version.

  2. Retrieve the Last Processed Row Version
    Query the checkpoint_table before each run to get the last processed row version.

  3. Query New Records
    Fetch records where row_version > last_processed_version.

  4. Process the Data
    After processing, update checkpoint_table with the new max row_version from the processed data.

You can implement this in a Databricks notebook using Delta Lake tables to ensure ACID compliance. While Delta Live Tables (DLT) has built-in checkpointing for streaming sources, this manual checkpointing approach is necessary for batch ingestion. Let me know if you need more details on implementation!

Thanks, I had considered the table storage approach also but wanted to see if there was an internally defined mechanism I could use. Thanks for your timely response. 

jeremy98
Contributor

Hi, 

I totally agree with VZLA, within my internal team we have a similar issue and we used a table to track the latest versions of each table, since we haven't a streaming process in our side. DLT pipelines could be a choice, but depends also if you want to be much free on using them.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group