Databricks Community

Thor · ‎11-26-2024

Hello,

currently the doc says that async progress tracking is available only for Kafka sink:
https://docs.databricks.com/en/structured-streaming/async-progress-checking.html

I would like to know if it would work for any sink that is "exactly once"?
I explain:
in many workflows, we read streamed data and merge the processed batch (increment) in an external database (Azure SQL, Snowflake, etc...) using a merge to ensure idempotency. But while merging, the Spark cluster is idle though we could start processing the next batch. So I think the async progress tracking could address this issue while merge statement ensures "exactly once" semantics. I don't see any impediment to this use case except maybe if this feature is forbidden for other sinks than Kafka.

cgrant · ‎11-26-2024

Asynchronous progress tracking is a feature designed for ultra low latency use cases. You can read more in the open source SPIP doc here, but the expected gain in time is in the hundreds of milliseconds, which seems insignificant when doing merge operations with external systems.

Once Delta Live Tables (DLT) releases functionality to write to external databases, I recommend trying it. DLT should give you a pretty big gain in efficiency for this use case.

View solution in original post

cgrant · ‎11-26-2024

Asynchronous progress tracking is a feature designed for ultra low latency use cases. You can read more in the open source SPIP doc here, but the expected gain in time is in the hundreds of milliseconds, which seems insignificant when doing merge operations with external systems.

Once Delta Live Tables (DLT) releases functionality to write to external databases, I recommend trying it. DLT should give you a pretty big gain in efficiency for this use case.

Databricks Community

Asynchronous progress tracking with foreachbatch

Join Us as a Local Community Builder!

🚀 Announcing the Databricks Data Intelligence Platform Cheat Sheet

Find Sensitive Data at Scale with Data Classification in Unity Catalog

Solution Accelerator Series | #6 - Adverse Drug Event Detection

Announcing Backfill Runs in Lakeflow Jobs for Higher Quality Downstream Data

🚀 New: Databricks Interactive Architecture Design Workshops