<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Delta Lake, CFD &amp;amp; SCD2 in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/delta-lake-cfd-amp-scd2/m-p/55631#M30385</link>
    <description>&lt;P&gt;I think you can explore DLT API "Apply Changes".&amp;nbsp;&lt;BR /&gt;You can run it only in DLT pipeline but it can read from streming endpoint or streaming table.&lt;BR /&gt;Please check docs:&lt;BR /&gt;&lt;A href="https://docs.databricks.com/en/delta-live-tables/cdc.html#language-python" target="_blank" rel="noopener"&gt;https://docs.databricks.com/en/delta-live-tables/cdc.html#language-python&lt;/A&gt;&lt;BR /&gt;You just include this line of code:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;  stored_as_scd_type = "2"&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;and you have your SCD2 logic done &lt;span class="lia-unicode-emoji" title=":slightly_smiling_face:"&gt;🙂&lt;/span&gt;&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;But if you want to do clasic engineering you are right, this is MERGE with&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;whenMatchedUpdate&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;You can do it either in pysaprk or spraksql and you can do it also in delta table straming.&lt;BR /&gt;&lt;BR /&gt;Please let me know if that helped or maybe I missunderstood your question.&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;</description>
    <pubDate>Thu, 21 Dec 2023 22:51:48 GMT</pubDate>
    <dc:creator>Wojciech_BUK</dc:creator>
    <dc:date>2023-12-21T22:51:48Z</dc:date>
    <item>
      <title>Delta Lake, CFD &amp; SCD2</title>
      <link>https://community.databricks.com/t5/data-engineering/delta-lake-cfd-amp-scd2/m-p/55598#M30380</link>
      <description>&lt;P&gt;Hi&lt;/P&gt;&lt;P&gt;What's the best way to deal with SCD2-styled tables in silver and/or gold layer while streaming.&lt;/P&gt;&lt;P&gt;From what I've seen in the Professional Data Engineer videos, they usually go for SCD1 tables (simple updates or deletes)&lt;/P&gt;&lt;P&gt;In a SCD2 scenario, we need to insert a new record (postimage) and "end-date" the old record in the target. Hence, two operations are required.&lt;/P&gt;&lt;P&gt;As of now, I can't see how to implement that in a streaming microbatch (foreachBatch) or CDC-CDF stream. In a "classic" DWH (including Data Vault) this is an extra-step. I guess this is not applicable in a streaming/near-realtime scenario, since we would have two active records until the old one was marked invalid.&lt;/P&gt;&lt;P&gt;So in other words, I wonder how to:&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;insert the new record, using the current/processing timestamp as start_ts (easy &amp;amp; tought in videos)&lt;/LI&gt;&lt;LI&gt;update the old record's end_ts from the new record's start_ts - 1 of the smallest unit (eg. second or milisecond) in "classic" SQL this could be another MERGE-when-matched using LEAD&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;Any thoughts?&lt;/P&gt;&lt;P&gt;Since I am "playing around" &amp;amp; learning, you may assume I follow the recommended Medallion architecture. So bronze would be a multi-plexed table with kafka/debezium/json records and the second stream from bronze to silver utilizes deduplication with watermarks and the PII-stuff presented in the training &lt;span class="lia-unicode-emoji" title=":slightly_smiling_face:"&gt;🙂&lt;/span&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 21 Dec 2023 13:07:16 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/delta-lake-cfd-amp-scd2/m-p/55598#M30380</guid>
      <dc:creator>quakenbush</dc:creator>
      <dc:date>2023-12-21T13:07:16Z</dc:date>
    </item>
    <item>
      <title>Re: Delta Lake, CFD &amp; SCD2</title>
      <link>https://community.databricks.com/t5/data-engineering/delta-lake-cfd-amp-scd2/m-p/55631#M30385</link>
      <description>&lt;P&gt;I think you can explore DLT API "Apply Changes".&amp;nbsp;&lt;BR /&gt;You can run it only in DLT pipeline but it can read from streming endpoint or streaming table.&lt;BR /&gt;Please check docs:&lt;BR /&gt;&lt;A href="https://docs.databricks.com/en/delta-live-tables/cdc.html#language-python" target="_blank" rel="noopener"&gt;https://docs.databricks.com/en/delta-live-tables/cdc.html#language-python&lt;/A&gt;&lt;BR /&gt;You just include this line of code:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;  stored_as_scd_type = "2"&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;and you have your SCD2 logic done &lt;span class="lia-unicode-emoji" title=":slightly_smiling_face:"&gt;🙂&lt;/span&gt;&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;But if you want to do clasic engineering you are right, this is MERGE with&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;whenMatchedUpdate&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;You can do it either in pysaprk or spraksql and you can do it also in delta table straming.&lt;BR /&gt;&lt;BR /&gt;Please let me know if that helped or maybe I missunderstood your question.&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 21 Dec 2023 22:51:48 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/delta-lake-cfd-amp-scd2/m-p/55631#M30385</guid>
      <dc:creator>Wojciech_BUK</dc:creator>
      <dc:date>2023-12-21T22:51:48Z</dc:date>
    </item>
    <item>
      <title>Re: Delta Lake, CFD &amp; SCD2</title>
      <link>https://community.databricks.com/t5/data-engineering/delta-lake-cfd-amp-scd2/m-p/55652#M30395</link>
      <description>&lt;P&gt;I did some further reading and got the same conclusion. APPLY CHANGES might to the trick. However, I don't like the limitations. From Bronze to Silver I might need .foreachBatch to implement the JSON-logic and the attribute names (__start_at / __end_at) seem to be static - this can make migrations of BI-layers (gold) hard&lt;/P&gt;&lt;P&gt;Anyway, I think you answered my question by confirmation. Since I'm doing this for learning &amp;amp; fun, I'd like to develop my own "Databricks-native Datavault 2.0"-Loader with respect of PII-handling as thought in the training &lt;span class="lia-unicode-emoji" title=":winking_face:"&gt;😉&lt;/span&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 22 Dec 2023 08:10:05 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/delta-lake-cfd-amp-scd2/m-p/55652#M30395</guid>
      <dc:creator>quakenbush</dc:creator>
      <dc:date>2023-12-22T08:10:05Z</dc:date>
    </item>
  </channel>
</rss>

