<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Asynchronous progress tracking with foreachbatch in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/asynchronous-progress-tracking-with-foreachbatch/m-p/100121#M40193</link>
    <description>&lt;P&gt;Hello,&lt;/P&gt;&lt;P&gt;currently the doc says that async progress tracking is available only for Kafka sink:&lt;BR /&gt;&lt;A href="https://docs.databricks.com/en/structured-streaming/async-progress-checking.html" target="_blank"&gt;https://docs.databricks.com/en/structured-streaming/async-progress-checking.html&lt;/A&gt;&lt;/P&gt;&lt;P&gt;I would like to know if it would work for any sink that is "exactly once"?&lt;BR /&gt;I explain:&lt;BR /&gt;in many workflows, we read streamed data and merge the processed batch (increment) in an external database (Azure SQL, Snowflake, etc...) using a merge to ensure idempotency. But while merging, the Spark cluster is idle though we could start processing the next batch. So I think the async progress tracking could address this issue while merge statement ensures&amp;nbsp;"exactly once" semantics. I don't see any impediment to this use case except maybe if this feature is forbidden for other sinks than Kafka.&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;</description>
    <pubDate>Tue, 26 Nov 2024 15:34:20 GMT</pubDate>
    <dc:creator>Thor</dc:creator>
    <dc:date>2024-11-26T15:34:20Z</dc:date>
    <item>
      <title>Asynchronous progress tracking with foreachbatch</title>
      <link>https://community.databricks.com/t5/data-engineering/asynchronous-progress-tracking-with-foreachbatch/m-p/100121#M40193</link>
      <description>&lt;P&gt;Hello,&lt;/P&gt;&lt;P&gt;currently the doc says that async progress tracking is available only for Kafka sink:&lt;BR /&gt;&lt;A href="https://docs.databricks.com/en/structured-streaming/async-progress-checking.html" target="_blank"&gt;https://docs.databricks.com/en/structured-streaming/async-progress-checking.html&lt;/A&gt;&lt;/P&gt;&lt;P&gt;I would like to know if it would work for any sink that is "exactly once"?&lt;BR /&gt;I explain:&lt;BR /&gt;in many workflows, we read streamed data and merge the processed batch (increment) in an external database (Azure SQL, Snowflake, etc...) using a merge to ensure idempotency. But while merging, the Spark cluster is idle though we could start processing the next batch. So I think the async progress tracking could address this issue while merge statement ensures&amp;nbsp;"exactly once" semantics. I don't see any impediment to this use case except maybe if this feature is forbidden for other sinks than Kafka.&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 26 Nov 2024 15:34:20 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/asynchronous-progress-tracking-with-foreachbatch/m-p/100121#M40193</guid>
      <dc:creator>Thor</dc:creator>
      <dc:date>2024-11-26T15:34:20Z</dc:date>
    </item>
    <item>
      <title>Re: Asynchronous progress tracking with foreachbatch</title>
      <link>https://community.databricks.com/t5/data-engineering/asynchronous-progress-tracking-with-foreachbatch/m-p/100158#M40204</link>
      <description>&lt;P&gt;Asynchronous progress tracking is a feature designed for ultra low latency use cases. You can read more in the open source SPIP doc &lt;A href="https://github.com/apache/spark/commit/e170a2eb236a376b036730b5d63371e753f1d947" target="_self"&gt;here&lt;/A&gt;, but the expected gain in time is in the hundreds of milliseconds, which seems insignificant when doing merge operations with external systems.&lt;/P&gt;
&lt;P&gt;Once Delta Live Tables (DLT) releases functionality to write to external databases, I recommend trying it. DLT should give you a pretty big gain in efficiency for this use case.&lt;/P&gt;</description>
      <pubDate>Tue, 26 Nov 2024 20:32:58 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/asynchronous-progress-tracking-with-foreachbatch/m-p/100158#M40204</guid>
      <dc:creator>cgrant</dc:creator>
      <dc:date>2024-11-26T20:32:58Z</dc:date>
    </item>
  </channel>
</rss>

