<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Schema evolution issue in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/schema-evolution-issue/m-p/33852#M24766</link>
    <description>&lt;P&gt;It's not on the writer that you need to evolve the schema, it's on the read size that you're running into the problem.  The docs &lt;A href="https://docs.databricks.com/spark/latest/structured-streaming/auto-loader-schema.html#schema-evolution" alt="https://docs.databricks.com/spark/latest/structured-streaming/auto-loader-schema.html#schema-evolution" target="_blank"&gt;here&lt;/A&gt; describe how to adjust the autoloader.  &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
    <pubDate>Fri, 03 Dec 2021 12:46:25 GMT</pubDate>
    <dc:creator>Anonymous</dc:creator>
    <dc:date>2021-12-03T12:46:25Z</dc:date>
    <item>
      <title>Schema evolution issue</title>
      <link>https://community.databricks.com/t5/data-engineering/schema-evolution-issue/m-p/33851#M24765</link>
      <description>&lt;P&gt;Hi All&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I am loading some data using auto loader but am having trouble with Schema evolution.&lt;/P&gt;&lt;P&gt;A new column has been added to the data I am loading and I am getting the following error:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;StreamingQueryException: Encountered unknown field(s) during parsing: {"SomeField":{}}&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I'm not 100% sure if this error is being thrown by autoloader or by structured streaming, but I am not specifying a schema on the CloudFiles config (Just a schema location) and I am setting the following option on the writeStream&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;.option("mergeSchema", "true")&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Does anyone have any thoughts on this?&lt;/P&gt;&lt;P&gt;Cheers&lt;/P&gt;&lt;P&gt;Mat&lt;/P&gt;</description>
      <pubDate>Fri, 03 Dec 2021 11:18:17 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/schema-evolution-issue/m-p/33851#M24765</guid>
      <dc:creator>Confused</dc:creator>
      <dc:date>2021-12-03T11:18:17Z</dc:date>
    </item>
    <item>
      <title>Re: Schema evolution issue</title>
      <link>https://community.databricks.com/t5/data-engineering/schema-evolution-issue/m-p/33852#M24766</link>
      <description>&lt;P&gt;It's not on the writer that you need to evolve the schema, it's on the read size that you're running into the problem.  The docs &lt;A href="https://docs.databricks.com/spark/latest/structured-streaming/auto-loader-schema.html#schema-evolution" alt="https://docs.databricks.com/spark/latest/structured-streaming/auto-loader-schema.html#schema-evolution" target="_blank"&gt;here&lt;/A&gt; describe how to adjust the autoloader.  &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 03 Dec 2021 12:46:25 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/schema-evolution-issue/m-p/33852#M24766</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2021-12-03T12:46:25Z</dc:date>
    </item>
    <item>
      <title>Re: Schema evolution issue</title>
      <link>https://community.databricks.com/t5/data-engineering/schema-evolution-issue/m-p/33853#M24767</link>
      <description>&lt;P&gt;Hi Josephk&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I had read that doc but I don't see where I am having an issue.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Per the first example it says I should be doing tthis:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;spark.readStream.format("cloudFiles") \&lt;/P&gt;&lt;P&gt;  .option("cloudFiles.format", "json") \&lt;/P&gt;&lt;P&gt;  .option("cloudFiles.schemaLocation", "&amp;lt;path_to_schema_location&amp;gt;") \&lt;/P&gt;&lt;P&gt;  .load("&amp;lt;path_to_source_data&amp;gt;") \&lt;/P&gt;&lt;P&gt;  .writeStream \&lt;/P&gt;&lt;P&gt;  .option("mergeSchema", "true") \&lt;/P&gt;&lt;P&gt;  .option("checkpointLocation", "&amp;lt;path_to_checkpoint&amp;gt;") \&lt;/P&gt;&lt;P&gt;  .start("&amp;lt;path_to_target")&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;So I have a few more cloudFiles options as I'm reading file notifications from a queue, but basically I am doing the same as above, not specifying a schema in the read, and setting mergeSchema in the write.&lt;/P&gt;</description>
      <pubDate>Fri, 03 Dec 2021 12:54:26 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/schema-evolution-issue/m-p/33853#M24767</guid>
      <dc:creator>Confused</dc:creator>
      <dc:date>2021-12-03T12:54:26Z</dc:date>
    </item>
    <item>
      <title>Re: Schema evolution issue</title>
      <link>https://community.databricks.com/t5/data-engineering/schema-evolution-issue/m-p/33854#M24768</link>
      <description>&lt;P&gt;You'll need to add the option on the reader for add new columns.  It's:&lt;/P&gt;&lt;P&gt;.option("cloudFiles.schemaEvolutionMode","addNewColumns").  &lt;/P&gt;</description>
      <pubDate>Fri, 03 Dec 2021 13:02:57 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/schema-evolution-issue/m-p/33854#M24768</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2021-12-03T13:02:57Z</dc:date>
    </item>
    <item>
      <title>Re: Schema evolution issue</title>
      <link>https://community.databricks.com/t5/data-engineering/schema-evolution-issue/m-p/33855#M24769</link>
      <description>&lt;P&gt;Hmmmm, I hadn't added it as that  doc says it is a default when you don't provide a schema.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;addNewColumns&lt;/P&gt;&lt;P&gt;: The default mode when a schema is not provided to Auto Loader.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I will give it a try though, thanks.&lt;/P&gt;</description>
      <pubDate>Fri, 03 Dec 2021 13:22:55 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/schema-evolution-issue/m-p/33855#M24769</guid>
      <dc:creator>Confused</dc:creator>
      <dc:date>2021-12-03T13:22:55Z</dc:date>
    </item>
    <item>
      <title>Re: Schema evolution issue</title>
      <link>https://community.databricks.com/t5/data-engineering/schema-evolution-issue/m-p/33856#M24770</link>
      <description>&lt;P&gt;Yeah I get the same error, ran the job twice per the docs as the first should fail then second succeed and  identical error.&lt;/P&gt;</description>
      <pubDate>Fri, 03 Dec 2021 14:06:31 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/schema-evolution-issue/m-p/33856#M24770</guid>
      <dc:creator>Confused</dc:creator>
      <dc:date>2021-12-03T14:06:31Z</dc:date>
    </item>
    <item>
      <title>Re: Schema evolution issue</title>
      <link>https://community.databricks.com/t5/data-engineering/schema-evolution-issue/m-p/33857#M24771</link>
      <description>&lt;P&gt;Hi all this is due to empty struct column which autoloader is confusing with a struct with some schema.&lt;/P&gt;&lt;P&gt;If we know the struct schema based on past give schema hint to autoloader for the struct or read this column as string and then parse it later using from_json or regexp_extract&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;A href="https://docs.databricks.com/spark/latest/structured-streaming/auto-loader-schema.html#schema-hints" alt="https://docs.databricks.com/spark/latest/structured-streaming/auto-loader-schema.html#schema-hints" target="_blank"&gt;https://docs.databricks.com/spark/latest/structured-streaming/auto-loader-schema.html#schema-hints&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 10 Dec 2021 14:03:08 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/schema-evolution-issue/m-p/33857#M24771</guid>
      <dc:creator>Soma</dc:creator>
      <dc:date>2021-12-10T14:03:08Z</dc:date>
    </item>
    <item>
      <title>Re: Schema evolution issue</title>
      <link>https://community.databricks.com/t5/data-engineering/schema-evolution-issue/m-p/33858#M24772</link>
      <description>&lt;P&gt;I agree that hints are the way to go if you have the schema available but the whole point of schema evolution is that you might not always know the schema in advance.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I received a similar error with a similar streaming query configuration. The issue was that the  read schema is derived from a limited sample of the files to be imported (configurable but 1000 files by default). The new field wasn't in the sample so it errored out when it ran into the new field later in the ingest process.&lt;/P&gt;</description>
      <pubDate>Fri, 15 Jul 2022 14:16:06 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/schema-evolution-issue/m-p/33858#M24772</guid>
      <dc:creator>rgrosskopf</dc:creator>
      <dc:date>2022-07-15T14:16:06Z</dc:date>
    </item>
  </channel>
</rss>

