cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Microbatching incremental updates Delta Live Tables

Erik_L
Contributor II

I need to create a workflow that pulls recent data from a database every two minutes, then transforms that data in various ways, and appends the results to a final table. The problem is that some of these changes _might_ update existing rows in the final table and I need to resolve the differences, because only columns with new data should be updated. That is, sometimes data can be delayed for a specific `event_time`. For example, `did_foo_value_exceed_n` should be updated when a foo comes in for an older `event_time`.

Anyway, I attempted to do this in Delta Live Tables. However, you cannot pull from a future table to join and merge changes before applying a CDC. I created a normal PySpark script that runs the merge and applies the merge with DeltaTable, but this cannot be used with a Delta Live Tables pipeline, because Workflows don't allow separate compute (Delta Live Tables compute vs Workflow compute) to access the same tables, so I can't take the result of the Delta Live Tables pipeline.

The biggest issue is that I can't use a triggered workflow because the time to retrieve compute is longer than the time I need to run this pipeline. Is there any way I can keep compute between Workflow runs?

2 REPLIES 2

jose_gonzalez
Databricks Employee
Databricks Employee

Hi @Erik_L ,

Just a friendly follow-up. Have you had a chance to review my colleague's response to your inquiry? Did it prove helpful, or are you still in need of assistance? Your response would be greatly appreciated.

Manisha_Jena
Databricks Employee
Databricks Employee

Hi @Erik_L,

As my colleague mentioned, to ensure continuous operation of the Delta Live Tables pipeline compute during Workflow runs, choosing a prolonged Databricks Job over a triggered Databricks Workflow is a reliable strategy. This extended job will maintain an ongoing Spark context, enabling the seamless execution of essential data transformations and merging tasks.

Please let me know if it resolves the issue.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group