<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Autoloader with availableNow=True and overwrite mode removes data in second micro-batch (DBR 16.3) in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/autoloader-with-availablenow-true-and-overwrite-mode-removes/m-p/123619#M47041</link>
    <description>&lt;P&gt;&lt;BR /&gt;Hi everyone,&lt;/P&gt;&lt;P&gt;I'm encountering an issue after upgrading to&amp;nbsp;Databricks Runtime 16.3, while using&amp;nbsp;Autoloader&amp;nbsp;with the following configuration:&lt;/P&gt;&lt;P&gt;trigger(availableNow=True)&lt;BR /&gt;outputMode("overwrite")&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;When a new file arrives, Autoloader processes it and writes the data to a Delta table. However, I consistently observe&amp;nbsp;two micro-batches&amp;nbsp;being triggered:&lt;/P&gt;&lt;P&gt;First micro-batch&amp;nbsp;ingests the file and writes data to the Delta table.&lt;BR /&gt;Second micro-batch, triggered just a few seconds later, finds no new files and still executes in overwrite mode — which ends up&amp;nbsp;removing the previously written data.&lt;BR /&gt;This behavior is confirmed in the Delta table history:&lt;/P&gt;&lt;P&gt;Version 422: file ingested, 3848 rows written.&lt;BR /&gt;Version 423: file removed, no new data written.&lt;BR /&gt;Checkpoint directory also shows commit files&amp;nbsp;422&amp;nbsp;and&amp;nbsp;423, confirming two micro-batches.&lt;/P&gt;&lt;P&gt;This issue&amp;nbsp;started occurring after we upgraded to DBR 16.3. Prior to that, the overwrite behavior was stable and did not remove data unexpectedly.&lt;/P&gt;&lt;P&gt;Has anyone else encountered this issue? Is there a recommended way to&amp;nbsp;prevent empty micro-batches from overwriting the table?&lt;/P&gt;&lt;P&gt;Any guidance or best practices would be greatly appreciated!&lt;/P&gt;&lt;P&gt;Thanks in advance.&lt;/P&gt;</description>
    <pubDate>Wed, 02 Jul 2025 09:47:30 GMT</pubDate>
    <dc:creator>divyansh8989</dc:creator>
    <dc:date>2025-07-02T09:47:30Z</dc:date>
    <item>
      <title>Autoloader with availableNow=True and overwrite mode removes data in second micro-batch (DBR 16.3)</title>
      <link>https://community.databricks.com/t5/data-engineering/autoloader-with-availablenow-true-and-overwrite-mode-removes/m-p/123619#M47041</link>
      <description>&lt;P&gt;&lt;BR /&gt;Hi everyone,&lt;/P&gt;&lt;P&gt;I'm encountering an issue after upgrading to&amp;nbsp;Databricks Runtime 16.3, while using&amp;nbsp;Autoloader&amp;nbsp;with the following configuration:&lt;/P&gt;&lt;P&gt;trigger(availableNow=True)&lt;BR /&gt;outputMode("overwrite")&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;When a new file arrives, Autoloader processes it and writes the data to a Delta table. However, I consistently observe&amp;nbsp;two micro-batches&amp;nbsp;being triggered:&lt;/P&gt;&lt;P&gt;First micro-batch&amp;nbsp;ingests the file and writes data to the Delta table.&lt;BR /&gt;Second micro-batch, triggered just a few seconds later, finds no new files and still executes in overwrite mode — which ends up&amp;nbsp;removing the previously written data.&lt;BR /&gt;This behavior is confirmed in the Delta table history:&lt;/P&gt;&lt;P&gt;Version 422: file ingested, 3848 rows written.&lt;BR /&gt;Version 423: file removed, no new data written.&lt;BR /&gt;Checkpoint directory also shows commit files&amp;nbsp;422&amp;nbsp;and&amp;nbsp;423, confirming two micro-batches.&lt;/P&gt;&lt;P&gt;This issue&amp;nbsp;started occurring after we upgraded to DBR 16.3. Prior to that, the overwrite behavior was stable and did not remove data unexpectedly.&lt;/P&gt;&lt;P&gt;Has anyone else encountered this issue? Is there a recommended way to&amp;nbsp;prevent empty micro-batches from overwriting the table?&lt;/P&gt;&lt;P&gt;Any guidance or best practices would be greatly appreciated!&lt;/P&gt;&lt;P&gt;Thanks in advance.&lt;/P&gt;</description>
      <pubDate>Wed, 02 Jul 2025 09:47:30 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/autoloader-with-availablenow-true-and-overwrite-mode-removes/m-p/123619#M47041</guid>
      <dc:creator>divyansh8989</dc:creator>
      <dc:date>2025-07-02T09:47:30Z</dc:date>
    </item>
    <item>
      <title>Re: Autoloader with availableNow=True and overwrite mode removes data in second micro-batch (DBR 16.</title>
      <link>https://community.databricks.com/t5/data-engineering/autoloader-with-availablenow-true-and-overwrite-mode-removes/m-p/123686#M47053</link>
      <description>&lt;DIV class=""&gt;&lt;P&gt;You've hit on a known behavioral change or subtle interaction in Databricks Runtime 16.3 with Autoloader, trigger(availableNow=True), and outputMode("overwrite"). This specific combination seems to be causing an unexpected second micro-batch that overwrites the data.&lt;/P&gt;&lt;P&gt;Here's a breakdown of why this might be happening and what you can do:&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Understanding the Behavior&lt;/STRONG&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;&lt;STRONG&gt;trigger(availableNow=True)&lt;/STRONG&gt;: This trigger processes all available data up to the moment the query starts as a single batch and then stops. It's designed for "run-once" or scheduled batch processing.&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;&lt;STRONG&gt;outputMode("overwrite")&lt;/STRONG&gt;: This output mode overwrites the entire Delta table with the data from the current micro-batch.&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;&lt;STRONG&gt;The Problem in DBR 16.3&lt;/STRONG&gt;: It appears that in DBR 16.3, even after the initial batch processes the new file and writes data, a subsequent &lt;I&gt;empty&lt;/I&gt; micro-batch is being triggered very quickly. Because outputMode("overwrite") is set, this empty batch then overwrites the table, effectively deleting the data that was just written.&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;This wasn't the typical behavior in earlier DBR versions, where availableNow=True combined with overwrite would usually result in one write and then a graceful termination if no more data was found.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Potential Causes/Hypotheses:&lt;/STRONG&gt;&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;&lt;P&gt;&lt;STRONG&gt;Change in availableNow Trigger Logic:&lt;/STRONG&gt; There might be a subtle change in how availableNow interacts with the internal state management in DBR 16.3, leading to an extra "empty" micro-batch check and subsequent overwrite.&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;&lt;STRONG&gt;Internal File Discovery/Checkpointing Changes:&lt;/STRONG&gt; Autoloader uses a checkpoint location to track processed files. It's possible that the way file discovery or checkpointing is handled in DBR 16.3, especially with availableNow, is leading to a quick second check that reports no new files but still initiates a write operation due to the overwrite mode.&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;&lt;STRONG&gt;Optimization or Bug:&lt;/STRONG&gt; It could be an unintended consequence of an optimization or a bug introduced in DBR 16.3 for specific scenarios involving availableNow and overwrite.&lt;/P&gt;&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;&lt;STRONG&gt;Recommended Solutions and Best Practices:&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;Given the observed behavior, here's how you can prevent the data loss:&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;&lt;P&gt;&lt;STRONG&gt;Change outputMode("overwrite") to outputMode("append"):&lt;/STRONG&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;&lt;STRONG&gt;Best Practice for Incremental Ingestion:&lt;/STRONG&gt; For streaming or incremental ingestion with Autoloader, append is almost always the correct outputMode. This ensures that new data is added to the table without affecting existing data.&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;&lt;STRONG&gt;Why Overwrite is Risky Here:&lt;/STRONG&gt; overwrite is generally used when you want to completely replace the table's contents, often in a full batch load scenario where you know the source will provide the complete dataset. For an incremental stream, it's problematic if an empty batch overwrites the entire table.&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;&lt;STRONG&gt;Use foreachBatch for Upserts/Merges:&lt;/STRONG&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;If your goal is to handle updates or deduplication (i.e., you might receive the same file again, or updated records), you should use outputMode("append") in conjunction with foreachBatch and perform a MERGE INTO operation.&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;This gives you granular control over how each micro-batch's data is integrated into the Delta table.&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;&lt;STRONG&gt;Example (PySpark):&lt;/STRONG&gt;&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;SPAN class=""&gt;Python&lt;/SPAN&gt;&lt;DIV class=""&gt;&amp;nbsp;&lt;/DIV&gt;&lt;/DIV&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;PRE&gt;&lt;SPAN class=""&gt;from&lt;/SPAN&gt; delta.tables &lt;SPAN class=""&gt;import&lt;/SPAN&gt; DeltaTable
&lt;SPAN class=""&gt;from&lt;/SPAN&gt; pyspark.sql &lt;SPAN class=""&gt;import&lt;/SPAN&gt; SparkSession

&lt;SPAN class=""&gt;# Initialize SparkSession (if not already done)&lt;/SPAN&gt;
spark = SparkSession.builder.appName(&lt;SPAN class=""&gt;"AutoloaderMerge"&lt;/SPAN&gt;).getOrCreate()

&lt;SPAN class=""&gt;# Define your source path and checkpoint location&lt;/SPAN&gt;
source_path = &lt;SPAN class=""&gt;"abfss://your-container@your-storage-account.dfs.core.windows.net/input/"&lt;/SPAN&gt;
delta_table_path = &lt;SPAN class=""&gt;"abfss://your-container@your-storage-account.dfs.core.windows.net/delta_table/"&lt;/SPAN&gt;
checkpoint_location = &lt;SPAN class=""&gt;"abfss://your-container@your-storage-account.dfs.core.windows.net/checkpoint/"&lt;/SPAN&gt;

&lt;SPAN class=""&gt;&lt;SPAN class=""&gt;def&lt;/SPAN&gt; &lt;SPAN class=""&gt;upsert_to_delta&lt;/SPAN&gt;(&lt;SPAN class=""&gt;microBatchDF, batchId&lt;/SPAN&gt;&lt;span class="lia-unicode-emoji" title=":disappointed_face:"&gt;😞&lt;/span&gt;&lt;/SPAN&gt;
    &lt;SPAN class=""&gt;# Create DeltaTable object if it doesn't exist&lt;/SPAN&gt;
    &lt;SPAN class=""&gt;if&lt;/SPAN&gt; &lt;SPAN class=""&gt;not&lt;/SPAN&gt; DeltaTable.isDeltaTable(spark, delta_table_path):
        microBatchDF.write.&lt;SPAN class=""&gt;format&lt;/SPAN&gt;(&lt;SPAN class=""&gt;"delta"&lt;/SPAN&gt;).mode(&lt;SPAN class=""&gt;"append"&lt;/SPAN&gt;).save(delta_table_path)
        print(&lt;SPAN class=""&gt;f"Batch &lt;SPAN class=""&gt;{batchId}&lt;/SPAN&gt;: Created initial Delta table."&lt;/SPAN&gt;)
    &lt;SPAN class=""&gt;else&lt;/SPAN&gt;:
        deltaTable = DeltaTable.forPath(spark, delta_table_path)
        &lt;SPAN class=""&gt;# Perform merge operation&lt;/SPAN&gt;
        deltaTable.alias(&lt;SPAN class=""&gt;"target"&lt;/SPAN&gt;) \
            .merge(
                microBatchDF.alias(&lt;SPAN class=""&gt;"source"&lt;/SPAN&gt;),
                &lt;SPAN class=""&gt;"target.id = source.id"&lt;/SPAN&gt; &lt;SPAN class=""&gt;# Replace 'id' with your actual primary key(s)&lt;/SPAN&gt;
            ) \
            .whenMatchedUpdateAll() \
            .whenNotMatchedInsertAll() \
            .execute()
        print(&lt;SPAN class=""&gt;f"Batch &lt;SPAN class=""&gt;{batchId}&lt;/SPAN&gt;: Merged &lt;SPAN class=""&gt;{microBatchDF.count()}&lt;/SPAN&gt; rows into Delta table."&lt;/SPAN&gt;)

(spark.readStream
    .&lt;SPAN class=""&gt;format&lt;/SPAN&gt;(&lt;SPAN class=""&gt;"cloudFiles"&lt;/SPAN&gt;)
    .option(&lt;SPAN class=""&gt;"cloudFiles.format"&lt;/SPAN&gt;, &lt;SPAN class=""&gt;"csv"&lt;/SPAN&gt;) &lt;SPAN class=""&gt;# Or your file format&lt;/SPAN&gt;
    .option(&lt;SPAN class=""&gt;"cloudFiles.schemaLocation"&lt;/SPAN&gt;, &lt;SPAN class=""&gt;f"&lt;SPAN class=""&gt;{checkpoint_location}&lt;/SPAN&gt;/schema"&lt;/SPAN&gt;)
    .option(&lt;SPAN class=""&gt;"cloudFiles.inferColumnTypes"&lt;/SPAN&gt;, &lt;SPAN class=""&gt;"true"&lt;/SPAN&gt;)
    .load(source_path)
    .writeStream
    .option(&lt;SPAN class=""&gt;"checkpointLocation"&lt;/SPAN&gt;, checkpoint_location)
    .foreachBatch(upsert_to_delta)
    .trigger(availableNow=&lt;SPAN class=""&gt;True&lt;/SPAN&gt;)
    .start()
    .awaitTermination())&lt;/PRE&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;&lt;STRONG&gt;Explanation:&lt;/STRONG&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;foreachBatch(upsert_to_delta): This will call your upsert_to_delta function for each micro-batch.&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Inside upsert_to_delta:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;It checks if the Delta table exists. If not, it creates it with mode("append") for the initial load.&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;If the table exists, it performs a MERGE INTO operation. You'll need to define your merge condition based on your table's primary key(s). This ensures that existing records are updated and new records are inserted.&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;&lt;STRONG&gt;Verify cloudFiles.allowOverwrites (and use with caution):&lt;/STRONG&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;The cloudFiles.allowOverwrites option (default false) controls whether Autoloader processes files again if they are appended to or overwritten in the source.&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;&lt;STRONG&gt;Important:&lt;/STRONG&gt; Setting cloudFiles.allowOverwrites to true (which is not typically recommended unless you manage duplicates downstream) might lead to reprocessing of files. Even then, it won't directly solve the issue of an &lt;I&gt;empty&lt;/I&gt; second micro-batch overwriting everything. It's more about how Autoloader re-ingests source files that have changed.&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;&lt;STRONG&gt;Report to Databricks Support:&lt;/STRONG&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;Since this behavior started specifically after upgrading to DBR 16.3 and was not present before, it's worth opening a support ticket with Databricks. Provide them with your exact configuration, DBR version, and the observed Delta table history. They can confirm if it's a known regression or a new intended (but perhaps problematic for your use case) behavior.&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;&lt;STRONG&gt;Why outputMode("overwrite") with availableNow=True is tricky for incremental loads:&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;When you use trigger(availableNow=True), the expectation is often that it processes everything &lt;I&gt;once&lt;/I&gt; and then finishes. Combining this with outputMode("overwrite") means that &lt;I&gt;each&lt;/I&gt; time this stream runs (whether it finds new data or not, if a subsequent "empty" batch is triggered), it will completely replace the target table. If there's no new data in a subsequent batch, the table effectively becomes empty.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;In summary, the most robust and recommended solution for your scenario is to switch from outputMode("overwrite") to using foreachBatch with MERGE INTO for idempotent updates/inserts, or simply outputMode("append") if you only expect new records and don't need to handle updates/deduplication at the Delta table level.&lt;/STRONG&gt;&lt;/P&gt;&lt;/DIV&gt;</description>
      <pubDate>Wed, 02 Jul 2025 13:06:50 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/autoloader-with-availablenow-true-and-overwrite-mode-removes/m-p/123686#M47053</guid>
      <dc:creator>ashesharyak</dc:creator>
      <dc:date>2025-07-02T13:06:50Z</dc:date>
    </item>
  </channel>
</rss>

