<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Handling data close to SCD2 with Delta tables in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/handling-data-close-to-scd2-with-delta-tables/m-p/55763#M30427</link>
    <description>&lt;P&gt;Your problem is exactly like SCD2 . You just add one more column with valid to date ( optionals you can add flag is actual to tag current records)&lt;/P&gt;&lt;P&gt;You can use DLT apply changes syntax. Alternatively Merge statement .&lt;/P&gt;&lt;P&gt;On the top of that table you can build views or function to queryy table at given timestamp.&lt;/P&gt;&lt;P&gt;Current state view will be with WHERE valit_to is null.&lt;/P&gt;&lt;P&gt;To query history at given state you just do 'your date' is between valid_from and valod_to.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Tue, 26 Dec 2023 22:19:43 GMT</pubDate>
    <dc:creator>Wojciech_BUK</dc:creator>
    <dc:date>2023-12-26T22:19:43Z</dc:date>
    <item>
      <title>Handling data close to SCD2 with Delta tables</title>
      <link>https://community.databricks.com/t5/data-engineering/handling-data-close-to-scd2-with-delta-tables/m-p/55752#M30423</link>
      <description>&lt;P&gt;Hello,&amp;nbsp;&lt;/P&gt;&lt;P&gt;stack used: pyspark and delta tables&lt;/P&gt;&lt;P&gt;I'm working with some data that look a bit like SCD2 data.&lt;/P&gt;&lt;P&gt;Basically, the data has columns that represent an id, a rank column and other informations, here's an example:&lt;/P&gt;&lt;P&gt;login, email, business_timestamp =&amp;gt; these fields are the "primary" key&lt;/P&gt;&lt;P&gt;received_date =&amp;gt; field to sort on&lt;/P&gt;&lt;P&gt;account_points =&amp;gt; other&lt;/P&gt;&lt;P&gt;For received_date 1:&lt;/P&gt;&lt;TABLE border="1" width="100%"&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TD width="20%" height="30px"&gt;login&lt;/TD&gt;&lt;TD width="20%" height="30px"&gt;email&lt;/TD&gt;&lt;TD width="20%" height="30px"&gt;business_timestamp&lt;/TD&gt;&lt;TD width="20%" height="30px"&gt;received_date&lt;/TD&gt;&lt;TD width="20%" height="30px"&gt;account_points&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="20%" height="30px"&gt;aaa&lt;/TD&gt;&lt;TD width="20%" height="30px"&gt;aaa@mail.com&lt;/TD&gt;&lt;TD width="20%" height="30px"&gt;2024-01-01T00:00&lt;/TD&gt;&lt;TD width="20%" height="30px"&gt;2024-01-01&lt;/TD&gt;&lt;TD width="20%" height="30px"&gt;10&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="20%" height="30px"&gt;bbb&lt;/TD&gt;&lt;TD width="20%" height="30px"&gt;bbb@mail.com&lt;/TD&gt;&lt;TD width="20%" height="30px"&gt;2024-01-01T00:00&lt;/TD&gt;&lt;TD width="20%" height="30px"&gt;2024-01-01&lt;/TD&gt;&lt;TD width="20%" height="30px"&gt;5&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="20%" height="30px"&gt;aaa&lt;/TD&gt;&lt;TD width="20%" height="30px"&gt;aaa@mail.com&lt;/TD&gt;&lt;TD width="20%" height="30px"&gt;2024-01-01T10:00&lt;/TD&gt;&lt;TD width="20%" height="30px"&gt;2024-01-01&lt;/TD&gt;&lt;TD width="20%" height="30px"&gt;7&lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;The data may or may not change in the future, and in the case it does change it'll be duplicated on the primary key fields (login, email, business_timestamp) and to find the one we want to keep we'll sort on the received_date (the latest would be the one to keep).&lt;/P&gt;&lt;P&gt;For received_date 2:&amp;nbsp;&lt;/P&gt;&lt;TABLE border="1" width="97.99235324593178%"&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TD width="20%" height="30px"&gt;login&lt;/TD&gt;&lt;TD width="20%" height="30px"&gt;email&lt;/TD&gt;&lt;TD width="20%" height="30px"&gt;business_timestamp&lt;/TD&gt;&lt;TD width="20%" height="30px"&gt;received_date&lt;/TD&gt;&lt;TD width="20%" height="30px"&gt;account_points&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="20%" height="30px"&gt;aaa&lt;/TD&gt;&lt;TD width="20%" height="30px"&gt;aaa@mail.com&lt;/TD&gt;&lt;TD width="20%" height="30px"&gt;2024-01-01T00:00&lt;/TD&gt;&lt;TD width="20%" height="30px"&gt;2024-01-02&lt;/TD&gt;&lt;TD width="20%" height="30px"&gt;15&lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;&lt;P&gt;Expected result:&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;historic view (from which we can rebuild any "best view" for a date with an sql query)&lt;/P&gt;&lt;P&gt;is the result of an naive append operation on the delta table everytime the ingestion is ran&lt;/P&gt;&lt;TABLE border="1" width="780px"&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TD width="156px" height="30px"&gt;login&lt;/TD&gt;&lt;TD width="156px" height="30px"&gt;email&lt;/TD&gt;&lt;TD width="156px" height="30px"&gt;business_timestamp&lt;/TD&gt;&lt;TD width="156px" height="30px"&gt;received_date&lt;/TD&gt;&lt;TD width="156px" height="30px"&gt;account_points&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="156px" height="30px"&gt;aaa&lt;/TD&gt;&lt;TD width="156px" height="30px"&gt;aaa@mail.com&lt;/TD&gt;&lt;TD width="156px" height="30px"&gt;2024-01-01T00:00&lt;/TD&gt;&lt;TD width="156px" height="30px"&gt;2024-01-01&lt;/TD&gt;&lt;TD width="156px" height="30px"&gt;10&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="156px" height="30px"&gt;bbb&lt;/TD&gt;&lt;TD width="156px" height="30px"&gt;bbb@mail.com&lt;/TD&gt;&lt;TD width="156px" height="30px"&gt;2024-01-01T00:00&lt;/TD&gt;&lt;TD width="156px" height="30px"&gt;2024-01-01&lt;/TD&gt;&lt;TD width="156px" height="30px"&gt;5&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="156px" height="30px"&gt;aaa&lt;/TD&gt;&lt;TD width="156px" height="30px"&gt;aaa@mail.com&lt;/TD&gt;&lt;TD width="156px" height="30px"&gt;2024-01-01T00:00&lt;/TD&gt;&lt;TD width="156px" height="30px"&gt;2024-01-02&lt;/TD&gt;&lt;TD width="156px" height="30px"&gt;15&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="156px" height="30px"&gt;aaa&lt;/TD&gt;&lt;TD width="156px" height="30px"&gt;aaa@mail.com&lt;/TD&gt;&lt;TD width="156px" height="30px"&gt;2024-01-01T10:00&lt;/TD&gt;&lt;TD width="156px" height="30px"&gt;2024-01-01&lt;/TD&gt;&lt;TD width="156px" height="30px"&gt;7&lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;best view (latest view for all business use)&lt;/P&gt;&lt;P&gt;is the result of a window rank function on the keys and sorting on the received date, it uses the historic view to create it.&lt;/P&gt;&lt;TABLE border="1" width="780px"&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TD width="156px" height="30px"&gt;login&lt;/TD&gt;&lt;TD width="156px" height="30px"&gt;email&lt;/TD&gt;&lt;TD width="156px" height="30px"&gt;business_timestamp&lt;/TD&gt;&lt;TD width="156px" height="30px"&gt;received_date&lt;/TD&gt;&lt;TD width="156px" height="30px"&gt;account_points&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="156px" height="30px"&gt;aaa&lt;/TD&gt;&lt;TD width="156px" height="30px"&gt;aaa@mail.com&lt;/TD&gt;&lt;TD width="156px" height="30px"&gt;2024-01-01T00:00&lt;/TD&gt;&lt;TD width="156px" height="30px"&gt;2024-01-02&lt;/TD&gt;&lt;TD width="156px" height="30px"&gt;15&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="156px" height="30px"&gt;bbb&lt;/TD&gt;&lt;TD width="156px" height="30px"&gt;bbb@mail.com&lt;/TD&gt;&lt;TD width="156px" height="30px"&gt;2024-01-01T00:00&lt;/TD&gt;&lt;TD width="156px" height="30px"&gt;2024-01-01&lt;/TD&gt;&lt;TD width="156px" height="30px"&gt;5&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="156px" height="30px"&gt;aaa&lt;/TD&gt;&lt;TD width="156px" height="30px"&gt;aaa@mail.com&lt;/TD&gt;&lt;TD width="156px" height="30px"&gt;2024-01-01T10:00&lt;/TD&gt;&lt;TD width="156px" height="30px"&gt;2024-01-01&lt;/TD&gt;&lt;TD width="156px" height="30px"&gt;7&lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;My question is: how would you design such a system to answer two problematics:&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;have a best view that is the singular data at the latest received_date&lt;/LI&gt;&lt;LI&gt;keep a historic of the evolutions to rebuild any "best view" at any given received date &amp;lt;&amp;gt; filter&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;I found that time travel won't really work because we can't customize the fields on which to travel (id and timestamp aren't practical).&amp;nbsp;&lt;/P&gt;&lt;P&gt;I also didn't find a real "upsert" operation using pyspark.&lt;/P&gt;&lt;P&gt;Thank you.&lt;/P&gt;</description>
      <pubDate>Tue, 26 Dec 2023 14:45:12 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/handling-data-close-to-scd2-with-delta-tables/m-p/55752#M30423</guid>
      <dc:creator>RabahO</dc:creator>
      <dc:date>2023-12-26T14:45:12Z</dc:date>
    </item>
    <item>
      <title>Re: Handling data close to SCD2 with Delta tables</title>
      <link>https://community.databricks.com/t5/data-engineering/handling-data-close-to-scd2-with-delta-tables/m-p/55763#M30427</link>
      <description>&lt;P&gt;Your problem is exactly like SCD2 . You just add one more column with valid to date ( optionals you can add flag is actual to tag current records)&lt;/P&gt;&lt;P&gt;You can use DLT apply changes syntax. Alternatively Merge statement .&lt;/P&gt;&lt;P&gt;On the top of that table you can build views or function to queryy table at given timestamp.&lt;/P&gt;&lt;P&gt;Current state view will be with WHERE valit_to is null.&lt;/P&gt;&lt;P&gt;To query history at given state you just do 'your date' is between valid_from and valod_to.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 26 Dec 2023 22:19:43 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/handling-data-close-to-scd2-with-delta-tables/m-p/55763#M30427</guid>
      <dc:creator>Wojciech_BUK</dc:creator>
      <dc:date>2023-12-26T22:19:43Z</dc:date>
    </item>
  </channel>
</rss>

