<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic DLT- apply_changes() SCD2  is not applying defined schema only for first run in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/dlt-apply-changes-scd2-is-not-applying-defined-schema-only-for/m-p/109797#M43401</link>
    <description>&lt;P&gt;Hello community,&lt;/P&gt;&lt;P&gt;I am using dlt.apply_changes function to implement SCD2. I am specifying the schema of my streaming_table that should result from apply_changes().&lt;/P&gt;&lt;P&gt;This schema contains a generated column.&lt;/P&gt;&lt;P&gt;Somehow, my DLT pipeline returns always &lt;U&gt;in first run&lt;/U&gt; my streaming table with generated column set to null.&lt;/P&gt;&lt;P&gt;Whenever, I fully refresh my pipeline, the generated column is computed correctly.&lt;/P&gt;&lt;P&gt;Is there any explanation why this problem arises only in the first run?&lt;BR /&gt;How can I avoid it as I want to destroy (using Databricks asset bundle) this pipeline and launches it from scratch only in testing phase of my CI pipeline?&lt;/P&gt;&lt;P&gt;Below you find my code,&lt;/P&gt;&lt;DIV&gt;&lt;PRE&gt;dlt.create_streaming_table(&lt;BR /&gt;    &lt;SPAN&gt;name&lt;/SPAN&gt;=&lt;SPAN&gt;"silver_table"&lt;/SPAN&gt;,&lt;BR /&gt;    &lt;SPAN&gt;schema&lt;/SPAN&gt;=&lt;SPAN&gt;"""row_id STRING NOT NULL,&lt;BR /&gt;&lt;/SPAN&gt;&lt;SPAN&gt;                col_a STRING NOT NULL,&lt;/SPAN&gt;&lt;SPAN&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;SPAN&gt;                `__START_AT` TIMESTAMP NOT NULL,&lt;BR /&gt;&lt;/SPAN&gt;&lt;SPAN&gt;                `__END_AT` TIMESTAMP,&lt;BR /&gt;&lt;/SPAN&gt;&lt;SPAN&gt;                last_updated TIMESTAMP,&lt;BR /&gt;&lt;/SPAN&gt;&lt;SPAN&gt;                is_current BOOLEAN NOT NULL GENERATED ALWAYS AS (CASE WHEN `__END_AT` IS NULL THEN true ELSE false END),&lt;/SPAN&gt;&lt;SPAN&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;SPAN&gt;    """&lt;/SPAN&gt;,&lt;BR /&gt;    &lt;SPAN&gt;cluster_by&lt;/SPAN&gt;=[&lt;SPAN&gt;"col_a&lt;/SPAN&gt;&lt;SPAN&gt;"&lt;/SPAN&gt;],&lt;BR /&gt;    &lt;SPAN&gt;comment&lt;/SPAN&gt;=&lt;SPAN&gt;"scd2 table in silver layer"&lt;/SPAN&gt;,&lt;BR /&gt;)&lt;BR /&gt;&lt;BR /&gt;dlt.apply_changes(&lt;BR /&gt;    &lt;SPAN&gt;source&lt;/SPAN&gt;=&lt;SPAN&gt;"data_input_cdc"&lt;/SPAN&gt;,&lt;BR /&gt;    &lt;SPAN&gt;target&lt;/SPAN&gt;=&lt;SPAN&gt;"silver_table"&lt;/SPAN&gt;,&lt;BR /&gt;    &lt;SPAN&gt;keys&lt;/SPAN&gt;=[&lt;SPAN&gt;"row_id&lt;/SPAN&gt;&lt;SPAN&gt;"&lt;/SPAN&gt;],&lt;BR /&gt;    &lt;SPAN&gt;sequence_by&lt;/SPAN&gt;=F.col(&lt;SPAN&gt;"synced"&lt;/SPAN&gt;),&lt;BR /&gt;    &lt;SPAN&gt;except_column_list&lt;/SPAN&gt;=[&lt;BR /&gt;        &lt;SPAN&gt;"synced"&lt;/SPAN&gt;,&lt;BR /&gt;        "&lt;SPAN&gt;record_deleted"&lt;/SPAN&gt;&lt;BR /&gt;    ],&lt;BR /&gt;    &lt;SPAN&gt;stored_as_scd_type&lt;/SPAN&gt;=&lt;SPAN&gt;2&lt;/SPAN&gt;,&lt;BR /&gt;    &lt;SPAN&gt;apply_as_deletes&lt;/SPAN&gt;=F.expr(&lt;SPAN&gt;"record_deleted= true"&lt;/SPAN&gt;),&lt;BR /&gt;)&lt;/PRE&gt;&lt;/DIV&gt;</description>
    <pubDate>Tue, 11 Feb 2025 13:59:15 GMT</pubDate>
    <dc:creator>HoussemBL</dc:creator>
    <dc:date>2025-02-11T13:59:15Z</dc:date>
    <item>
      <title>DLT- apply_changes() SCD2  is not applying defined schema only for first run</title>
      <link>https://community.databricks.com/t5/data-engineering/dlt-apply-changes-scd2-is-not-applying-defined-schema-only-for/m-p/109797#M43401</link>
      <description>&lt;P&gt;Hello community,&lt;/P&gt;&lt;P&gt;I am using dlt.apply_changes function to implement SCD2. I am specifying the schema of my streaming_table that should result from apply_changes().&lt;/P&gt;&lt;P&gt;This schema contains a generated column.&lt;/P&gt;&lt;P&gt;Somehow, my DLT pipeline returns always &lt;U&gt;in first run&lt;/U&gt; my streaming table with generated column set to null.&lt;/P&gt;&lt;P&gt;Whenever, I fully refresh my pipeline, the generated column is computed correctly.&lt;/P&gt;&lt;P&gt;Is there any explanation why this problem arises only in the first run?&lt;BR /&gt;How can I avoid it as I want to destroy (using Databricks asset bundle) this pipeline and launches it from scratch only in testing phase of my CI pipeline?&lt;/P&gt;&lt;P&gt;Below you find my code,&lt;/P&gt;&lt;DIV&gt;&lt;PRE&gt;dlt.create_streaming_table(&lt;BR /&gt;    &lt;SPAN&gt;name&lt;/SPAN&gt;=&lt;SPAN&gt;"silver_table"&lt;/SPAN&gt;,&lt;BR /&gt;    &lt;SPAN&gt;schema&lt;/SPAN&gt;=&lt;SPAN&gt;"""row_id STRING NOT NULL,&lt;BR /&gt;&lt;/SPAN&gt;&lt;SPAN&gt;                col_a STRING NOT NULL,&lt;/SPAN&gt;&lt;SPAN&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;SPAN&gt;                `__START_AT` TIMESTAMP NOT NULL,&lt;BR /&gt;&lt;/SPAN&gt;&lt;SPAN&gt;                `__END_AT` TIMESTAMP,&lt;BR /&gt;&lt;/SPAN&gt;&lt;SPAN&gt;                last_updated TIMESTAMP,&lt;BR /&gt;&lt;/SPAN&gt;&lt;SPAN&gt;                is_current BOOLEAN NOT NULL GENERATED ALWAYS AS (CASE WHEN `__END_AT` IS NULL THEN true ELSE false END),&lt;/SPAN&gt;&lt;SPAN&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;SPAN&gt;    """&lt;/SPAN&gt;,&lt;BR /&gt;    &lt;SPAN&gt;cluster_by&lt;/SPAN&gt;=[&lt;SPAN&gt;"col_a&lt;/SPAN&gt;&lt;SPAN&gt;"&lt;/SPAN&gt;],&lt;BR /&gt;    &lt;SPAN&gt;comment&lt;/SPAN&gt;=&lt;SPAN&gt;"scd2 table in silver layer"&lt;/SPAN&gt;,&lt;BR /&gt;)&lt;BR /&gt;&lt;BR /&gt;dlt.apply_changes(&lt;BR /&gt;    &lt;SPAN&gt;source&lt;/SPAN&gt;=&lt;SPAN&gt;"data_input_cdc"&lt;/SPAN&gt;,&lt;BR /&gt;    &lt;SPAN&gt;target&lt;/SPAN&gt;=&lt;SPAN&gt;"silver_table"&lt;/SPAN&gt;,&lt;BR /&gt;    &lt;SPAN&gt;keys&lt;/SPAN&gt;=[&lt;SPAN&gt;"row_id&lt;/SPAN&gt;&lt;SPAN&gt;"&lt;/SPAN&gt;],&lt;BR /&gt;    &lt;SPAN&gt;sequence_by&lt;/SPAN&gt;=F.col(&lt;SPAN&gt;"synced"&lt;/SPAN&gt;),&lt;BR /&gt;    &lt;SPAN&gt;except_column_list&lt;/SPAN&gt;=[&lt;BR /&gt;        &lt;SPAN&gt;"synced"&lt;/SPAN&gt;,&lt;BR /&gt;        "&lt;SPAN&gt;record_deleted"&lt;/SPAN&gt;&lt;BR /&gt;    ],&lt;BR /&gt;    &lt;SPAN&gt;stored_as_scd_type&lt;/SPAN&gt;=&lt;SPAN&gt;2&lt;/SPAN&gt;,&lt;BR /&gt;    &lt;SPAN&gt;apply_as_deletes&lt;/SPAN&gt;=F.expr(&lt;SPAN&gt;"record_deleted= true"&lt;/SPAN&gt;),&lt;BR /&gt;)&lt;/PRE&gt;&lt;/DIV&gt;</description>
      <pubDate>Tue, 11 Feb 2025 13:59:15 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/dlt-apply-changes-scd2-is-not-applying-defined-schema-only-for/m-p/109797#M43401</guid>
      <dc:creator>HoussemBL</dc:creator>
      <dc:date>2025-02-11T13:59:15Z</dc:date>
    </item>
    <item>
      <title>Re: DLT- apply_changes() SCD2  is not applying defined schema only for first run</title>
      <link>https://community.databricks.com/t5/data-engineering/dlt-apply-changes-scd2-is-not-applying-defined-schema-only-for/m-p/109805#M43403</link>
      <description>&lt;P&gt;Hello&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/115968"&gt;@HoussemBL&lt;/a&gt;,&lt;/P&gt;
&lt;P class="_1t7bu9h1 paragraph"&gt;Here are a few points to consider:&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;
&lt;P class="_1t7bu9h1 paragraph"&gt;&lt;STRONG&gt;Initialization of Generated Columns&lt;/STRONG&gt;: Generated columns, such as &lt;CODE&gt;is_current&lt;/CODE&gt;, rely on the values of other columns (&lt;CODE&gt;__END_AT&lt;/CODE&gt; in this case) to be correctly populated. During the first run, if the sequencing or initialization of these columns is not handled correctly, it can result in null values.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P class="_1t7bu9h1 paragraph"&gt;&lt;SPAN&gt;&lt;STRONG&gt;Sequencing and Ordering&lt;/STRONG&gt;: The &lt;CODE&gt;apply_changes&lt;/CODE&gt; function uses the &lt;CODE&gt;sequence_by&lt;/CODE&gt; column to determine the order of changes. If the sequencing is not correctly established during the first run, it can lead to issues with the generated columns. Ensure that the &lt;CODE&gt;sequence_by&lt;/CODE&gt; column (&lt;CODE&gt;synced&lt;/CODE&gt; in your case) is correctly populated and ordered.&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/LI&gt;
&lt;/OL&gt;
&lt;P&gt;&lt;SPAN&gt;When you fully refresh the pipeline, it reprocesses the data, which can correct the sequencing and initialization issues, leading to the correct computation of the generated column&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 11 Feb 2025 14:20:50 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/dlt-apply-changes-scd2-is-not-applying-defined-schema-only-for/m-p/109805#M43403</guid>
      <dc:creator>Alberto_Umana</dc:creator>
      <dc:date>2025-02-11T14:20:50Z</dc:date>
    </item>
    <item>
      <title>Re: DLT- apply_changes() SCD2  is not applying defined schema only for first run</title>
      <link>https://community.databricks.com/t5/data-engineering/dlt-apply-changes-scd2-is-not-applying-defined-schema-only-for/m-p/109933#M43435</link>
      <description>&lt;P&gt;Hello&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/106294"&gt;@Alberto_Umana&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thanks for your reply.&lt;BR /&gt;I checked the impact of sequencing and ordering.&amp;nbsp;For that, I ran my DLT pipeline with an input dataset of one row.&amp;nbsp;&lt;BR /&gt;Still, I am getting the same behavior (error for the first run of the DLT pipeline, then success for the second trial with full refresh)&lt;/P&gt;</description>
      <pubDate>Wed, 12 Feb 2025 06:29:36 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/dlt-apply-changes-scd2-is-not-applying-defined-schema-only-for/m-p/109933#M43435</guid>
      <dc:creator>HoussemBL</dc:creator>
      <dc:date>2025-02-12T06:29:36Z</dc:date>
    </item>
  </channel>
</rss>

