cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Asynchronous progress tracking with foreachbatch

Thor
New Contributor III

Hello,

currently the doc says that async progress tracking is available only for Kafka sink:
https://docs.databricks.com/en/structured-streaming/async-progress-checking.html

I would like to know if it would work for any sink that is "exactly once"?
I explain:
in many workflows, we read streamed data and merge the processed batch (increment) in an external database (Azure SQL, Snowflake, etc...) using a merge to ensure idempotency. But while merging, the Spark cluster is idle though we could start processing the next batch. So I think the async progress tracking could address this issue while merge statement ensures "exactly once" semantics. I don't see any impediment to this use case except maybe if this feature is forbidden for other sinks than Kafka.

1 ACCEPTED SOLUTION

Accepted Solutions

cgrant
Databricks Employee
Databricks Employee

Asynchronous progress tracking is a feature designed for ultra low latency use cases. You can read more in the open source SPIP doc here, but the expected gain in time is in the hundreds of milliseconds, which seems insignificant when doing merge operations with external systems.

Once Delta Live Tables (DLT) releases functionality to write to external databases, I recommend trying it. DLT should give you a pretty big gain in efficiency for this use case.

View solution in original post

1 REPLY 1

cgrant
Databricks Employee
Databricks Employee

Asynchronous progress tracking is a feature designed for ultra low latency use cases. You can read more in the open source SPIP doc here, but the expected gain in time is in the hundreds of milliseconds, which seems insignificant when doing merge operations with external systems.

Once Delta Live Tables (DLT) releases functionality to write to external databases, I recommend trying it. DLT should give you a pretty big gain in efficiency for this use case.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group