<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Adding a new column triggers reprocessing of Auto Loader source table in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/adding-a-new-column-triggers-reprocessing-of-auto-loader-source/m-p/64676#M32625</link>
    <description>&lt;P&gt;Ok, so how does schema evolution relate to .option("mergeSchema", "true") then? Are they different things? Do they step on each other toes?&lt;/P&gt;&lt;P&gt;If I make non-breaking changes to the schema (just adding), am I to understand that I can simply remove the .option("mergeSchema", "true")?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Tue, 26 Mar 2024 15:27:12 GMT</pubDate>
    <dc:creator>cosminsanda</dc:creator>
    <dc:date>2024-03-26T15:27:12Z</dc:date>
    <item>
      <title>Adding a new column triggers reprocessing of Auto Loader source table</title>
      <link>https://community.databricks.com/t5/data-engineering/adding-a-new-column-triggers-reprocessing-of-auto-loader-source/m-p/64535#M32599</link>
      <description>&lt;P&gt;I have a source table A in Unity Catalog. This table is constantly written to and is a streaming table.&lt;BR /&gt;I also have another table B in Unity Catalog. This is a managed table with liquid clustering.&lt;/P&gt;&lt;P&gt;Using Auto Loader I move new data from A to B using a code similar to the following:&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;&lt;DIV&gt;&lt;PRE&gt;streaming_query = (&lt;BR /&gt;    spark.readStream.option(&lt;SPAN&gt;"ignoreDeletes"&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;SPAN&gt;"true"&lt;/SPAN&gt;)&lt;BR /&gt;    .table(&lt;SPAN&gt;"&lt;/SPAN&gt;catalog&lt;SPAN&gt;.bronze.A"&lt;/SPAN&gt;)&lt;BR /&gt;    .selectExpr(&lt;SPAN&gt;&lt;BR /&gt;&lt;/SPAN&gt;        &lt;SPAN&gt;"column_1"&lt;/SPAN&gt;&lt;SPAN&gt;,&lt;BR /&gt;&lt;/SPAN&gt;        &lt;SPAN&gt;"column_2"&lt;BR /&gt;&lt;/SPAN&gt;    )&lt;BR /&gt;    .writeStream.option(&lt;SPAN&gt;"checkpointLocation"&lt;/SPAN&gt;&lt;SPAN&gt;, "/Volumes/catalog/silver/checkpoints/B"&lt;/SPAN&gt;)&lt;BR /&gt;    .option(&lt;SPAN&gt;"mergeSchema"&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;SPAN&gt;"true"&lt;/SPAN&gt;)&lt;BR /&gt;    .trigger(&lt;SPAN&gt;availableNow&lt;/SPAN&gt;=&lt;SPAN&gt;True&lt;/SPAN&gt;)&lt;BR /&gt;    .toTable("B")&lt;BR /&gt;)&lt;BR /&gt;&lt;BR /&gt;streaming_query.awaitTermination(&lt;SPAN&gt;timeout&lt;/SPAN&gt;=&lt;SPAN&gt;300&lt;/SPAN&gt;)&lt;/PRE&gt;&lt;P&gt;Everything was running smoothly, but I then decided to add a new column to the SELECT, so I end up with a SELECT similar to:&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;&lt;DIV&gt;&lt;PRE&gt;.selectExpr(&lt;SPAN&gt;&lt;BR /&gt;&lt;/SPAN&gt;    &lt;SPAN&gt;"column_1"&lt;/SPAN&gt;&lt;SPAN&gt;,&lt;BR /&gt;&lt;/SPAN&gt;    &lt;SPAN&gt;"column_2"&lt;/SPAN&gt;&lt;SPAN&gt;,&lt;BR /&gt;&lt;/SPAN&gt;    &lt;SPAN&gt;"column_3"&lt;BR /&gt;&lt;/SPAN&gt;)&lt;/PRE&gt;&lt;/DIV&gt;&lt;P&gt;&amp;nbsp;Now, after this configuration change, all the existing data in B got duplicated. The new column was added correctly to the duplicated records. It is as if the whole A table was reprocessed and appended to the already existing data in B.&lt;/P&gt;&lt;P&gt;Is this the expected behaviour based on my configuration? What can I do to avoid this in the future and just have the old data have NULL in the added columns?&lt;/P&gt;&lt;/DIV&gt;</description>
      <pubDate>Mon, 25 Mar 2024 16:19:28 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/adding-a-new-column-triggers-reprocessing-of-auto-loader-source/m-p/64535#M32599</guid>
      <dc:creator>cosminsanda</dc:creator>
      <dc:date>2024-03-25T16:19:28Z</dc:date>
    </item>
    <item>
      <title>Re: Adding a new column triggers reprocessing of Auto Loader source table</title>
      <link>https://community.databricks.com/t5/data-engineering/adding-a-new-column-triggers-reprocessing-of-auto-loader-source/m-p/64676#M32625</link>
      <description>&lt;P&gt;Ok, so how does schema evolution relate to .option("mergeSchema", "true") then? Are they different things? Do they step on each other toes?&lt;/P&gt;&lt;P&gt;If I make non-breaking changes to the schema (just adding), am I to understand that I can simply remove the .option("mergeSchema", "true")?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 26 Mar 2024 15:27:12 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/adding-a-new-column-triggers-reprocessing-of-auto-loader-source/m-p/64676#M32625</guid>
      <dc:creator>cosminsanda</dc:creator>
      <dc:date>2024-03-26T15:27:12Z</dc:date>
    </item>
    <item>
      <title>Re: Adding a new column triggers reprocessing of Auto Loader source table</title>
      <link>https://community.databricks.com/t5/data-engineering/adding-a-new-column-triggers-reprocessing-of-auto-loader-source/m-p/64677#M32626</link>
      <description>&lt;UL&gt;&lt;LI&gt;To prevent data duplication, you can perform data deduplication based on a unique identifier (e.g., a primary key). -&amp;gt; I don't want to do data deduplication every time I add a column.&lt;/LI&gt;&lt;LI&gt;If your data has a natural key or timestamp, consider using it to identify unique records during the merge process. -&amp;gt; How?&lt;/LI&gt;&lt;/UL&gt;</description>
      <pubDate>Tue, 26 Mar 2024 15:40:58 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/adding-a-new-column-triggers-reprocessing-of-auto-loader-source/m-p/64677#M32626</guid>
      <dc:creator>cosminsanda</dc:creator>
      <dc:date>2024-03-26T15:40:58Z</dc:date>
    </item>
    <item>
      <title>Re: Adding a new column triggers reprocessing of Auto Loader source table</title>
      <link>https://community.databricks.com/t5/data-engineering/adding-a-new-column-triggers-reprocessing-of-auto-loader-source/m-p/64681#M32628</link>
      <description>&lt;P&gt;Ok , I might be totally wrong here but it seems you are not using autoloader for moving data from A to B. Autoloader is an easy way to process ingested files. But here you run a spark streaming query on table A.&lt;BR /&gt;When you change the selectExpr, the streaming query is restarted and the whole table is sent to B.&lt;/P&gt;</description>
      <pubDate>Tue, 26 Mar 2024 16:04:17 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/adding-a-new-column-triggers-reprocessing-of-auto-loader-source/m-p/64681#M32628</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2024-03-26T16:04:17Z</dc:date>
    </item>
    <item>
      <title>Re: Adding a new column triggers reprocessing of Auto Loader source table</title>
      <link>https://community.databricks.com/t5/data-engineering/adding-a-new-column-triggers-reprocessing-of-auto-loader-source/m-p/64740#M32641</link>
      <description>&lt;P&gt;Inded,&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/14792"&gt;@-werners-&lt;/a&gt;&amp;nbsp;, you might be right, since I don't really use the cloudfiles functionality, which is really what Auto Loader is about.&lt;/P&gt;&lt;P&gt;In any case, even if it's just regular structured streaming, given that I have a configured checkpoint where the processed data log is snapshotted to, I don't see how reprocessing the whole table is a reasonable thing to do.&lt;/P&gt;</description>
      <pubDate>Wed, 27 Mar 2024 07:20:18 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/adding-a-new-column-triggers-reprocessing-of-auto-loader-source/m-p/64740#M32641</guid>
      <dc:creator>cosminsanda</dc:creator>
      <dc:date>2024-03-27T07:20:18Z</dc:date>
    </item>
    <item>
      <title>Re: Adding a new column triggers reprocessing of Auto Loader source table</title>
      <link>https://community.databricks.com/t5/data-engineering/adding-a-new-column-triggers-reprocessing-of-auto-loader-source/m-p/64745#M32642</link>
      <description>&lt;P&gt;The purpose of checkpoints is to recover after a failure.&amp;nbsp; In your case, the streaming query is changed.&amp;nbsp; Structured streaming isn't stateless.&amp;nbsp; This means that in general the checkpoints cannot be reused when the query is changed.&lt;BR /&gt;Depending on the case, you might be able to recover semantics after changes:&lt;BR /&gt;&lt;A href="https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#recovery-semantics-after-changes-in-a-streaming-query" target="_blank"&gt;https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#recovery-semantics-after-changes-in-a-streaming-query&lt;/A&gt;.&lt;BR /&gt;In your case the schema changes:&lt;BR /&gt;&lt;EM&gt;Structured Streaming automatically checkpoints the state data to fault-tolerant storage (for example, HDFS, AWS S3, Azure Blob storage) and restores it after restart. However, this assumes that the schema of the state data remains same across restarts.&lt;/EM&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 27 Mar 2024 07:31:45 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/adding-a-new-column-triggers-reprocessing-of-auto-loader-source/m-p/64745#M32642</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2024-03-27T07:31:45Z</dc:date>
    </item>
    <item>
      <title>Re: Adding a new column triggers reprocessing of Auto Loader source table</title>
      <link>https://community.databricks.com/t5/data-engineering/adding-a-new-column-triggers-reprocessing-of-auto-loader-source/m-p/65273#M32763</link>
      <description>&lt;P&gt;What if I alter my target table B in advance, so that it contains the new columns before the query starts writing to them?&lt;/P&gt;</description>
      <pubDate>Tue, 02 Apr 2024 07:30:29 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/adding-a-new-column-triggers-reprocessing-of-auto-loader-source/m-p/65273#M32763</guid>
      <dc:creator>cosminsanda</dc:creator>
      <dc:date>2024-04-02T07:30:29Z</dc:date>
    </item>
    <item>
      <title>Re: Adding a new column triggers reprocessing of Auto Loader source table</title>
      <link>https://community.databricks.com/t5/data-engineering/adding-a-new-column-triggers-reprocessing-of-auto-loader-source/m-p/65274#M32764</link>
      <description>&lt;P&gt;I don't think that would work as you would still have to change the query to select the new columns (unless you apply a select *).&lt;BR /&gt;Here is an overview of what can be changed and what not:&lt;BR /&gt;&lt;A href="https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#recovery-semantics-after-changes-in-a-streaming-query" target="_self"&gt;https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#recovery-semantics-after-changes-in-a-streaming-query&lt;/A&gt;&lt;/P&gt;&lt;P&gt;Your case falls under &lt;EM&gt;Changes in projection / filter / map-like operations&lt;/EM&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 02 Apr 2024 07:45:18 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/adding-a-new-column-triggers-reprocessing-of-auto-loader-source/m-p/65274#M32764</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2024-04-02T07:45:18Z</dc:date>
    </item>
    <item>
      <title>Re: Adding a new column triggers reprocessing of Auto Loader source table</title>
      <link>https://community.databricks.com/t5/data-engineering/adding-a-new-column-triggers-reprocessing-of-auto-loader-source/m-p/65275#M32765</link>
      <description>&lt;P&gt;change data feed might be a solution for you perhaps.&lt;BR /&gt;&lt;A href="https://docs.databricks.com/en/delta/delta-change-data-feed.html" target="_self"&gt;https://docs.databricks.com/en/delta/delta-change-data-feed.html&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 02 Apr 2024 07:54:35 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/adding-a-new-column-triggers-reprocessing-of-auto-loader-source/m-p/65275#M32765</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2024-04-02T07:54:35Z</dc:date>
    </item>
  </channel>
</rss>

