<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Streaming vs Batch with Continuous Trigger in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/streaming-vs-batch-with-continuous-trigger/m-p/113650#M44595</link>
    <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/132713"&gt;@ashap551&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;You're essentially implementing a well-optimized micro-batching process, and functionally, it's very similar to what readStream() with Autoloader would do. However, there are some advantages to using Autoloader and a proper streaming table that might be worth considering.&lt;/P&gt;&lt;H3&gt;&lt;STRONG&gt;Concrete Advantages of Streaming Tables &amp;amp; Autoloader in Your Case&lt;/STRONG&gt;&lt;/H3&gt;&lt;OL&gt;&lt;LI&gt;&lt;P&gt;&lt;STRONG&gt;Scalability &amp;amp; Efficiency&lt;/STRONG&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;Your current approach works well because you control partitioning and file listing manually. But as the number of files grows, spark.read.parquet() might experience performance degradation due to listing overhead.&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;&lt;STRONG&gt;Autoloader&lt;/STRONG&gt; eliminates the need for explicit listing by leveraging AWS SNS/SQS or its file notification mode, reducing metadata operations.&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;&lt;STRONG&gt;No Need for Explicit Checkpointing &amp;amp; File Filtering&lt;/STRONG&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;Right now, you're manually tracking the last processed file via checkpoints.&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;With readStream(), Autoloader automatically tracks processed files and ensures no duplication without requiring explicit filtering logic.&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;&lt;STRONG&gt;True Streaming vs. Continuous Micro-Batching&lt;/STRONG&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;Even though your workflow runs every ~30 seconds, there's still a small gap where data is waiting to be processed.&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;readStream() with &lt;STRONG&gt;Trigger.Once or Trigger.AvailableNow&lt;/STRONG&gt; can reduce end-to-end latency since Spark processes new files immediately instead of waiting for the next scheduled batch.&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;&lt;STRONG&gt;Automatic Schema Evolution&lt;/STRONG&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;You mentioned handling schema evolution manually. Autoloader can simplify this with mergeSchema = true and Databricks' schema evolution capabilities.&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;&lt;STRONG&gt;Easier Integration with Delta Change Data Feed (CDF)&lt;/STRONG&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;If your use case evolves and you need CDF, streaming tables integrate more naturally with readStream() and writeStream().&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;&lt;STRONG&gt;Cost Optimization with Serverless Compute (Future Consideration)&lt;/STRONG&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;Since you’re keeping an interactive cluster always running, this might be expensive. With streaming tables, you could potentially move to &lt;STRONG&gt;Databricks Serverless Streaming&lt;/STRONG&gt; or Photon, reducing costs.&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;/OL&gt;&lt;H3&gt;&lt;STRONG&gt;Should You Switch?&lt;/STRONG&gt;&lt;/H3&gt;&lt;P&gt;If your current method is working well, &lt;STRONG&gt;there's no immediate need to switch&lt;/STRONG&gt;. However, if you're expecting &lt;STRONG&gt;higher data volumes&lt;/STRONG&gt;, &lt;STRONG&gt;schema changes&lt;/STRONG&gt;, or &lt;STRONG&gt;lower-latency requirements&lt;/STRONG&gt;, then using Autoloader and readStream() would provide more efficiency and automation.&lt;/P&gt;</description>
    <pubDate>Wed, 26 Mar 2025 06:43:11 GMT</pubDate>
    <dc:creator>Ajay-Pandey</dc:creator>
    <dc:date>2025-03-26T06:43:11Z</dc:date>
    <item>
      <title>Streaming vs Batch with Continuous Trigger</title>
      <link>https://community.databricks.com/t5/data-engineering/streaming-vs-batch-with-continuous-trigger/m-p/113477#M44549</link>
      <description>&lt;P&gt;Not sure what the concrete advantage there is for me to create a streaming table vs static one. &amp;nbsp;In my case, I designed a table with a job that extracts the most lastest files from an s3 location and then appends them to a delta table. &amp;nbsp;I set the job to run continuous. &amp;nbsp;Change Feed arrives approximately every minute in S3, &amp;nbsp;and my job takes about 20 to 30 seconds to process the microbatched feed. &amp;nbsp;I keep an interactive cluster on all the time so to minimize spin up time.&lt;/P&gt;&lt;P&gt;if I were to switch to AWS SMS and SQS service then utilize Autoloader for the same table, what concrete advantage do I gain? &amp;nbsp;It seems like the SMS and SQS is standard for streaming from AWS. &amp;nbsp;But wouldn’t my process be sufficient? &amp;nbsp; Every refresh (twice a minute) I use partitioning of s3 folders (year/month/day/timestamp.parquet) to extract only the latest files. &amp;nbsp;Then spark.read.parquet() and bingo- the batch is quickly processed without any issues. &amp;nbsp;I checkpoint every run to only filter for files since the last refresh.&lt;/P&gt;&lt;P&gt;i also manage schema evolution with my own internal code. &amp;nbsp;In handle potential late arriving data via a record time column. It all works very well. &amp;nbsp;So what is the advantage and reason I should not utilize the micro batching process with continuous trigger workflow in favor of read stream? &amp;nbsp;Or are we essentially doing the same thing?&lt;/P&gt;</description>
      <pubDate>Tue, 25 Mar 2025 08:47:26 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/streaming-vs-batch-with-continuous-trigger/m-p/113477#M44549</guid>
      <dc:creator>ashap551</dc:creator>
      <dc:date>2025-03-25T08:47:26Z</dc:date>
    </item>
    <item>
      <title>Re: Streaming vs Batch with Continuous Trigger</title>
      <link>https://community.databricks.com/t5/data-engineering/streaming-vs-batch-with-continuous-trigger/m-p/113650#M44595</link>
      <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/132713"&gt;@ashap551&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;You're essentially implementing a well-optimized micro-batching process, and functionally, it's very similar to what readStream() with Autoloader would do. However, there are some advantages to using Autoloader and a proper streaming table that might be worth considering.&lt;/P&gt;&lt;H3&gt;&lt;STRONG&gt;Concrete Advantages of Streaming Tables &amp;amp; Autoloader in Your Case&lt;/STRONG&gt;&lt;/H3&gt;&lt;OL&gt;&lt;LI&gt;&lt;P&gt;&lt;STRONG&gt;Scalability &amp;amp; Efficiency&lt;/STRONG&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;Your current approach works well because you control partitioning and file listing manually. But as the number of files grows, spark.read.parquet() might experience performance degradation due to listing overhead.&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;&lt;STRONG&gt;Autoloader&lt;/STRONG&gt; eliminates the need for explicit listing by leveraging AWS SNS/SQS or its file notification mode, reducing metadata operations.&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;&lt;STRONG&gt;No Need for Explicit Checkpointing &amp;amp; File Filtering&lt;/STRONG&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;Right now, you're manually tracking the last processed file via checkpoints.&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;With readStream(), Autoloader automatically tracks processed files and ensures no duplication without requiring explicit filtering logic.&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;&lt;STRONG&gt;True Streaming vs. Continuous Micro-Batching&lt;/STRONG&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;Even though your workflow runs every ~30 seconds, there's still a small gap where data is waiting to be processed.&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;readStream() with &lt;STRONG&gt;Trigger.Once or Trigger.AvailableNow&lt;/STRONG&gt; can reduce end-to-end latency since Spark processes new files immediately instead of waiting for the next scheduled batch.&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;&lt;STRONG&gt;Automatic Schema Evolution&lt;/STRONG&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;You mentioned handling schema evolution manually. Autoloader can simplify this with mergeSchema = true and Databricks' schema evolution capabilities.&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;&lt;STRONG&gt;Easier Integration with Delta Change Data Feed (CDF)&lt;/STRONG&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;If your use case evolves and you need CDF, streaming tables integrate more naturally with readStream() and writeStream().&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;&lt;STRONG&gt;Cost Optimization with Serverless Compute (Future Consideration)&lt;/STRONG&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;Since you’re keeping an interactive cluster always running, this might be expensive. With streaming tables, you could potentially move to &lt;STRONG&gt;Databricks Serverless Streaming&lt;/STRONG&gt; or Photon, reducing costs.&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;/OL&gt;&lt;H3&gt;&lt;STRONG&gt;Should You Switch?&lt;/STRONG&gt;&lt;/H3&gt;&lt;P&gt;If your current method is working well, &lt;STRONG&gt;there's no immediate need to switch&lt;/STRONG&gt;. However, if you're expecting &lt;STRONG&gt;higher data volumes&lt;/STRONG&gt;, &lt;STRONG&gt;schema changes&lt;/STRONG&gt;, or &lt;STRONG&gt;lower-latency requirements&lt;/STRONG&gt;, then using Autoloader and readStream() would provide more efficiency and automation.&lt;/P&gt;</description>
      <pubDate>Wed, 26 Mar 2025 06:43:11 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/streaming-vs-batch-with-continuous-trigger/m-p/113650#M44595</guid>
      <dc:creator>Ajay-Pandey</dc:creator>
      <dc:date>2025-03-26T06:43:11Z</dc:date>
    </item>
  </channel>
</rss>

