<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Databricks Autoloader Schema Evolution throws StateSchemaNotCompatible exception in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/databricks-autoloader-schema-evolution-throws/m-p/53649#M29857</link>
    <description>&lt;P&gt;Hey&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/9"&gt;@Retired_mod&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;&lt;P&gt;Thank you for the answer.&amp;nbsp;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Clearing checkpoint data is, unfortunately, not an option. The Stream would reprocess all the data again, and this is not what I want since the Stream is running incrementally.&lt;/LI&gt;&lt;LI&gt;Manual schema declaration is also not an option since I want to add new columns.&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;What confuses me is that the&amp;nbsp;&lt;STRONG&gt;StateSchemaNotCompatible&amp;nbsp;&lt;/STRONG&gt;exception is emitted from Spark Structured Streaming and is not an AutoLoader exception.&amp;nbsp;&lt;/P&gt;&lt;P&gt;When I add a new column to the base table, the Stream fails with the&amp;nbsp;&lt;SPAN&gt;NEW_FIELDS_IN_RECORD_WITH_FILE_PATH exception, which is expected when specifying addNewColumns.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;When I restart the Stream, it fails with&amp;nbsp;&lt;STRONG&gt;StateSchemaNotCompatible,&lt;/STRONG&gt;&lt;STRONG&gt;&amp;nbsp;&lt;/STRONG&gt;which shouldn't be the case since the schema should be updated as soon as AutoLoader fails with the&amp;nbsp;&lt;SPAN&gt;NEW_FIELDS_IN_RECORD_WITH_FILE_PATH exception.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;My use case seems to be straightforward. I can not imagine that I am the only one that tries to run AutoLoader with:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Structured Streaming&lt;/LI&gt;&lt;LI&gt;JSON files as source&lt;/LI&gt;&lt;LI&gt;Column Type Inference&lt;/LI&gt;&lt;LI&gt;Automated Schema Evolution&lt;/LI&gt;&lt;LI&gt;Delta as the target&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Thu, 23 Nov 2023 15:01:19 GMT</pubDate>
    <dc:creator>robertkoss</dc:creator>
    <dc:date>2023-11-23T15:01:19Z</dc:date>
    <item>
      <title>Databricks Autoloader Schema Evolution throws StateSchemaNotCompatible exception</title>
      <link>https://community.databricks.com/t5/data-engineering/databricks-autoloader-schema-evolution-throws/m-p/53609#M29843</link>
      <description>&lt;P&gt;I am trying to use Databricks Autoloader for a very simple use case:&lt;/P&gt;&lt;P&gt;Reading JSONs from S3 and loading them into a delta table, with schema inference and evolution.&lt;/P&gt;&lt;P&gt;This is my code:&lt;/P&gt;&lt;PRE&gt;self.spark \
      .readStream \
      .format("cloudFiles") \
      .option("cloudFiles.format", "json") \
      .option("cloudFiles.inferColumnTypes", "true") \
      .option("cloudFiles.schemaLocation", f"{self.target_s3_bucket}/_schema/{source_table_name}") \
      .load(f"{self.source_s3_bucket}/{source_table_name}") \
      .distinct() \
      .writeStream \
      .trigger(availableNow=True) \
      .format("delta") \
      .option("mergeSchema", "true") \
      .option("checkpointLocation", f"{self.target_s3_bucket}/_checkpoint/{source_table_name}") \
      .option("streamName", source_table_name) \
      .start(f"{self.target_s3_bucket}/{target_table_name}")&lt;/PRE&gt;&lt;P&gt;When a JSON with an unknown column arrives, the Stream fails, as expected, with a&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;NEW_FIELDS_IN_RECORD_WITH_FILE_PATH&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;exception.&lt;/P&gt;&lt;P&gt;But when I retry the job, I get the following exception:&lt;/P&gt;&lt;PRE&gt;StateSchemaNotCompatible: Provided schema doesn't match to the schema for existing state! Please note that Spark allow difference of field name: check count of fields and data type of each field.&lt;/PRE&gt;&lt;P&gt;This is my first time using Autoloader, am I doing something obviously wrong?&lt;/P&gt;&lt;P&gt;I've posted this already to StackOverflow and got some answers that were not that helpful though:&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;A href="https://stackoverflow.com/questions/77482302/databricks-autoloader-schema-evolution-throws-stateschemanotcompatible-exception" target="_blank"&gt;https://stackoverflow.com/questions/77482302/databricks-autoloader-schema-evolution-throws-stateschemanotcompatible-exception&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 23 Nov 2023 09:49:10 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/databricks-autoloader-schema-evolution-throws/m-p/53609#M29843</guid>
      <dc:creator>robertkoss</dc:creator>
      <dc:date>2023-11-23T09:49:10Z</dc:date>
    </item>
    <item>
      <title>Re: Databricks Autoloader Schema Evolution throws StateSchemaNotCompatible exception</title>
      <link>https://community.databricks.com/t5/data-engineering/databricks-autoloader-schema-evolution-throws/m-p/53649#M29857</link>
      <description>&lt;P&gt;Hey&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/9"&gt;@Retired_mod&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;&lt;P&gt;Thank you for the answer.&amp;nbsp;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Clearing checkpoint data is, unfortunately, not an option. The Stream would reprocess all the data again, and this is not what I want since the Stream is running incrementally.&lt;/LI&gt;&lt;LI&gt;Manual schema declaration is also not an option since I want to add new columns.&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;What confuses me is that the&amp;nbsp;&lt;STRONG&gt;StateSchemaNotCompatible&amp;nbsp;&lt;/STRONG&gt;exception is emitted from Spark Structured Streaming and is not an AutoLoader exception.&amp;nbsp;&lt;/P&gt;&lt;P&gt;When I add a new column to the base table, the Stream fails with the&amp;nbsp;&lt;SPAN&gt;NEW_FIELDS_IN_RECORD_WITH_FILE_PATH exception, which is expected when specifying addNewColumns.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;When I restart the Stream, it fails with&amp;nbsp;&lt;STRONG&gt;StateSchemaNotCompatible,&lt;/STRONG&gt;&lt;STRONG&gt;&amp;nbsp;&lt;/STRONG&gt;which shouldn't be the case since the schema should be updated as soon as AutoLoader fails with the&amp;nbsp;&lt;SPAN&gt;NEW_FIELDS_IN_RECORD_WITH_FILE_PATH exception.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;My use case seems to be straightforward. I can not imagine that I am the only one that tries to run AutoLoader with:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Structured Streaming&lt;/LI&gt;&lt;LI&gt;JSON files as source&lt;/LI&gt;&lt;LI&gt;Column Type Inference&lt;/LI&gt;&lt;LI&gt;Automated Schema Evolution&lt;/LI&gt;&lt;LI&gt;Delta as the target&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 23 Nov 2023 15:01:19 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/databricks-autoloader-schema-evolution-throws/m-p/53649#M29857</guid>
      <dc:creator>robertkoss</dc:creator>
      <dc:date>2023-11-23T15:01:19Z</dc:date>
    </item>
    <item>
      <title>Re: Databricks Autoloader Schema Evolution throws StateSchemaNotCompatible exception</title>
      <link>https://community.databricks.com/t5/data-engineering/databricks-autoloader-schema-evolution-throws/m-p/101931#M40898</link>
      <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/94132"&gt;@robertkoss&lt;/a&gt;&amp;nbsp;I have the exact same problem... have you found a solution ?&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 12 Dec 2024 14:39:43 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/databricks-autoloader-schema-evolution-throws/m-p/101931#M40898</guid>
      <dc:creator>Nes_Hdr</dc:creator>
      <dc:date>2024-12-12T14:39:43Z</dc:date>
    </item>
    <item>
      <title>Re: Databricks Autoloader Schema Evolution throws StateSchemaNotCompatible exception</title>
      <link>https://community.databricks.com/t5/data-engineering/databricks-autoloader-schema-evolution-throws/m-p/101954#M40910</link>
      <description>&lt;P&gt;Hey, the problem is&amp;nbsp;&lt;/P&gt;&lt;PRE&gt;distinct()&lt;/PRE&gt;&lt;P&gt;because it requires a state.&lt;/P&gt;</description>
      <pubDate>Thu, 12 Dec 2024 15:37:08 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/databricks-autoloader-schema-evolution-throws/m-p/101954#M40910</guid>
      <dc:creator>robertkoss</dc:creator>
      <dc:date>2024-12-12T15:37:08Z</dc:date>
    </item>
  </channel>
</rss>

