topic Microbatching incremental updates Delta Live Tables in Data Engineering

Microbatching incremental updates Delta Live Tables

Erik_L — Thu, 26 Oct 2023 17:15:53 GMT

I need to create a workflow that pulls recent data from a database every two minutes, then transforms that data in various ways, and appends the results to a final table. The problem is that some of these changes _might_ update existing rows in the final table and I need to resolve the differences, because only columns with new data should be updated. That is, sometimes data can be delayed for a specific `event_time`. For example, `did_foo_value_exceed_n` should be updated when a foo comes in for an older `event_time`.

Anyway, I attempted to do this in Delta Live Tables. However, you cannot pull from a future table to join and merge changes before applying a CDC. I created a normal PySpark script that runs the merge and applies the merge with DeltaTable, but this cannot be used with a Delta Live Tables pipeline, because Workflows don't allow separate compute (Delta Live Tables compute vs Workflow compute) to access the same tables, so I can't take the result of the Delta Live Tables pipeline.

The biggest issue is that I can't use a triggered workflow because the time to retrieve compute is longer than the time I need to run this pipeline. Is there any way I can keep compute between Workflow runs?

Re: Microbatching incremental updates Delta Live Tables

jose_gonzalez — Wed, 01 Nov 2023 17:19:37 GMT

Hi @Erik_L ,

Just a friendly follow-up. Have you had a chance to review my colleague's response to your inquiry? Did it prove helpful, or are you still in need of assistance? Your response would be greatly appreciated.

Re: Microbatching incremental updates Delta Live Tables

Manisha_Jena — Fri, 10 Nov 2023 09:22:10 GMT

Hi @Erik_L,

As my colleague mentioned, to ensure continuous operation of the Delta Live Tables pipeline compute during Workflow runs, choosing a prolonged Databricks Job over a triggered Databricks Workflow is a reliable strategy. This extended job will maintain an ongoing Spark context, enabling the seamless execution of essential data transformations and merging tasks.

Please let me know if it resolves the issue.