Asynchronous progress tracking with foreachbatch

Thor — Tue, 26 Nov 2024 15:34:20 GMT

Hello,

currently the doc says that async progress tracking is available only for Kafka sink:
https://docs.databricks.com/en/structured-streaming/async-progress-checking.html

I would like to know if it would work for any sink that is "exactly once"?
I explain:
in many workflows, we read streamed data and merge the processed batch (increment) in an external database (Azure SQL, Snowflake, etc...) using a merge to ensure idempotency. But while merging, the Spark cluster is idle though we could start processing the next batch. So I think the async progress tracking could address this issue while merge statement ensures "exactly once" semantics. I don't see any impediment to this use case except maybe if this feature is forbidden for other sinks than Kafka.

Re: Asynchronous progress tracking with foreachbatch

cgrant — Tue, 26 Nov 2024 20:32:58 GMT

Asynchronous progress tracking is a feature designed for ultra low latency use cases. You can read more in the open source SPIP doc here, but the expected gain in time is in the hundreds of milliseconds, which seems insignificant when doing merge operations with external systems.

Once Delta Live Tables (DLT) releases functionality to write to external databases, I recommend trying it. DLT should give you a pretty big gain in efficiency for this use case.

topic Asynchronous progress tracking with foreachbatch in Data Engineering

Asynchronous progress tracking with foreachbatch

Re: Asynchronous progress tracking with foreachbatch