Databricks Community

Erik_L · ‎10-26-2023

I need to create a workflow that pulls recent data from a database every two minutes, then transforms that data in various ways, and appends the results to a final table. The problem is that some of these changes _might_ update existing rows in the final table and I need to resolve the differences, because only columns with new data should be updated. That is, sometimes data can be delayed for a specific `event_time`. For example, `did_foo_value_exceed_n` should be updated when a foo comes in for an older `event_time`.

Anyway, I attempted to do this in Delta Live Tables. However, you cannot pull from a future table to join and merge changes before applying a CDC. I created a normal PySpark script that runs the merge and applies the merge with DeltaTable, but this cannot be used with a Delta Live Tables pipeline, because Workflows don't allow separate compute (Delta Live Tables compute vs Workflow compute) to access the same tables, so I can't take the result of the Delta Live Tables pipeline.

The biggest issue is that I can't use a triggered workflow because the time to retrieve compute is longer than the time I need to run this pipeline. Is there any way I can keep compute between Workflow runs?

jose_gonzalez · ‎11-01-2023

Hi @Erik_L ,

Just a friendly follow-up. Have you had a chance to review my colleague's response to your inquiry? Did it prove helpful, or are you still in need of assistance? Your response would be greatly appreciated.

Manisha_Jena · ‎11-10-2023

Hi @Erik_L,

As my colleague mentioned, to ensure continuous operation of the Delta Live Tables pipeline compute during Workflow runs, choosing a prolonged Databricks Job over a triggered Databricks Workflow is a reliable strategy. This extended job will maintain an ongoing Spark context, enabling the seamless execution of essential data transformations and merging tasks.

Please let me know if it resolves the issue.