<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Delta table with unique columns incremental refresh in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/delta-table-with-unique-columns-incremental-refresh/m-p/78513#M35546</link>
    <description>&lt;P&gt;Hi Team,&lt;/P&gt;&lt;P&gt;We have one huge streaming table from which we want to create another streaming table in which we will pick few columns from the original streaming table. But in this new table the rows must be unique.&lt;/P&gt;&lt;P&gt;Can someone please help me with the implementation of this scenario if possible and if not why and is there any alternatives to this?&lt;/P&gt;&lt;P&gt;Thanks.&lt;/P&gt;</description>
    <pubDate>Fri, 12 Jul 2024 11:47:42 GMT</pubDate>
    <dc:creator>MyTrh</dc:creator>
    <dc:date>2024-07-12T11:47:42Z</dc:date>
    <item>
      <title>Delta table with unique columns incremental refresh</title>
      <link>https://community.databricks.com/t5/data-engineering/delta-table-with-unique-columns-incremental-refresh/m-p/78513#M35546</link>
      <description>&lt;P&gt;Hi Team,&lt;/P&gt;&lt;P&gt;We have one huge streaming table from which we want to create another streaming table in which we will pick few columns from the original streaming table. But in this new table the rows must be unique.&lt;/P&gt;&lt;P&gt;Can someone please help me with the implementation of this scenario if possible and if not why and is there any alternatives to this?&lt;/P&gt;&lt;P&gt;Thanks.&lt;/P&gt;</description>
      <pubDate>Fri, 12 Jul 2024 11:47:42 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/delta-table-with-unique-columns-incremental-refresh/m-p/78513#M35546</guid>
      <dc:creator>MyTrh</dc:creator>
      <dc:date>2024-07-12T11:47:42Z</dc:date>
    </item>
    <item>
      <title>Re: Delta table with unique columns incremental refresh</title>
      <link>https://community.databricks.com/t5/data-engineering/delta-table-with-unique-columns-incremental-refresh/m-p/78682#M35583</link>
      <description>&lt;P&gt;You can do this using Delta Live Tables, For design you can follow medallion architecture (&lt;A href="https://www.databricks.com/glossary/medallion-architecture" target="_blank"&gt;https://www.databricks.com/glossary/medallion-architecture&lt;/A&gt;).&lt;BR /&gt;You can have your 1st table as bronze with everything just appended and second table as silver with defined keys and selected columns.&lt;BR /&gt;You can take a look here for how to implement this -&amp;nbsp;&lt;A href="https://docs.databricks.com/en/delta-live-tables/tutorial-pipelines.html" target="_blank"&gt;https://docs.databricks.com/en/delta-live-tables/tutorial-pipelines.html&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Sun, 14 Jul 2024 08:08:46 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/delta-table-with-unique-columns-incremental-refresh/m-p/78682#M35583</guid>
      <dc:creator>p4pratikjain</dc:creator>
      <dc:date>2024-07-14T08:08:46Z</dc:date>
    </item>
    <item>
      <title>Re: Delta table with unique columns incremental refresh</title>
      <link>https://community.databricks.com/t5/data-engineering/delta-table-with-unique-columns-incremental-refresh/m-p/78684#M35584</link>
      <description>&lt;P&gt;Hi &lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/111926"&gt;@MyTrh&lt;/a&gt;&amp;nbsp;,&amp;nbsp;&lt;/P&gt;&lt;P&gt;I think what you can try is to read from the first streaming table and then use foreatchbatch when writing to the target streaming table. Inside&amp;nbsp;foreatchbatch you should have defined merge logic and that should do the trick.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Edit: I didn't notice that you've mentioned dlt. Ignore this post then, because I can't delete it &lt;span class="lia-unicode-emoji" title=":winking_face:"&gt;😉&lt;/span&gt;&lt;/STRONG&gt;&lt;/P&gt;</description>
      <pubDate>Sun, 14 Jul 2024 08:52:33 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/delta-table-with-unique-columns-incremental-refresh/m-p/78684#M35584</guid>
      <dc:creator>szymon_dybczak</dc:creator>
      <dc:date>2024-07-14T08:52:33Z</dc:date>
    </item>
    <item>
      <title>Re: Delta table with unique columns incremental refresh</title>
      <link>https://community.databricks.com/t5/data-engineering/delta-table-with-unique-columns-incremental-refresh/m-p/78762#M35597</link>
      <description>&lt;P&gt;Thanks for the responses&lt;/P&gt;&lt;P&gt;It seems that I was not able to explain the problem statement properly so here it is:&lt;/P&gt;&lt;P&gt;I have one streaming table in my dlt pipeline say table1 with schema (user_id, department_id, time_stamp, clicks).&lt;/P&gt;&lt;P&gt;From this streaming table I want to make another table(streaming table or materialized view) table2 with schema (user_id, department_id, clicks) and in this new table user_id and department_id must be unique and clicks should be updated everytime new data comes in table1.&amp;nbsp;&lt;/P&gt;&lt;P&gt;So far I am able to make a table2 using materialized view but it is doing complete refresh(flow type: complete) each time to calculate table2. What I want is to make incremental update(flow type: incremental) on my table2 so that it doesn't need to read the whole table1 and then make table2 based on that.&lt;/P&gt;&lt;P&gt;Again, I have seen answers on incremental refresh but none of them is in dlt pipeline notebook, so I am not sure whether this is even supported in dlt pipeline or not. It would be great is someone can provide the sample code along with explanation&lt;/P&gt;&lt;P&gt;Thanks in advace&lt;/P&gt;</description>
      <pubDate>Mon, 15 Jul 2024 08:28:51 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/delta-table-with-unique-columns-incremental-refresh/m-p/78762#M35597</guid>
      <dc:creator>MyTrh</dc:creator>
      <dc:date>2024-07-15T08:28:51Z</dc:date>
    </item>
    <item>
      <title>Re: Delta table with unique columns incremental refresh</title>
      <link>https://community.databricks.com/t5/data-engineering/delta-table-with-unique-columns-incremental-refresh/m-p/78819#M35606</link>
      <description>&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Hi &lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/111926"&gt;@MyTrh&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;&lt;P&gt;Thanks for clarification.&amp;nbsp;&lt;SPAN&gt;I'll try tomorrow recreate your use case and test it. But your approach seems to be correct. As you, I would create a materialized view that would consume changes from streaming table.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;Could you check if your query that you use to define materialized view uses only below keywords? They are required for incremental refresh, so if you are using some keywords not listed there, then that could be a reason why you have a complete refresh.&lt;/P&gt;&lt;P&gt;&lt;A href="https://learn.microsoft.com/en-us/azure/databricks/optimizations/incremental-refresh#support-for-materialized-view-incremental-refresh" target="_blank" rel="noopener"&gt;https://learn.microsoft.com/en-us/azure/databricks/optimizations/incremental-refresh#support-for-materialized-view-incremental-refresh&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 15 Jul 2024 15:00:58 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/delta-table-with-unique-columns-incremental-refresh/m-p/78819#M35606</guid>
      <dc:creator>szymon_dybczak</dc:creator>
      <dc:date>2024-07-15T15:00:58Z</dc:date>
    </item>
    <item>
      <title>Re: Delta table with unique columns incremental refresh</title>
      <link>https://community.databricks.com/t5/data-engineering/delta-table-with-unique-columns-incremental-refresh/m-p/79018#M35660</link>
      <description>&lt;P&gt;Thanks&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/110502"&gt;@szymon_dybczak&lt;/a&gt;&amp;nbsp;for looking into my query.&lt;/P&gt;&lt;P&gt;Yes, I am using allowed keywords only to define the materialized view.&lt;/P&gt;&lt;P&gt;It will be really helpful it you can provide the notebook solution for this use case.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 16 Jul 2024 16:30:40 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/delta-table-with-unique-columns-incremental-refresh/m-p/79018#M35660</guid>
      <dc:creator>MyTrh</dc:creator>
      <dc:date>2024-07-16T16:30:40Z</dc:date>
    </item>
    <item>
      <title>Re: Delta table with unique columns incremental refresh</title>
      <link>https://community.databricks.com/t5/data-engineering/delta-table-with-unique-columns-incremental-refresh/m-p/79069#M35667</link>
      <description>&lt;P&gt;Hi &lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/111926"&gt;@MyTrh&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;&lt;P&gt;Ok, I think I created similiar use case to yours. I have streaming table with column structure based on your example&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="image.png" style="width: 999px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/9638i49B1808D04EF37C1/image-size/large/is-moderation-mode/true?v=v2&amp;amp;px=999" role="button" title="image.png" alt="image.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;CREATE OR REFRESH STREAMING TABLE clicks_raw AS SELECT *, current_timestamp() as load_time FROM cloud_files('/Volumes/dev/demo_db/landing_zone/clicks_data', "csv", map("cloudFiles.inferColumnTypes", "true"))&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Now, I have another streaming table which is consuming the first one. Here I perform grouping to achieve distinct combination of user_id and department_id and I'm aggregating count for each group.&lt;/P&gt;&lt;P&gt;So as you can see from sceen below, take for example user_id =&amp;nbsp;user_001 and departament_id = dept_01.&lt;/P&gt;&lt;P&gt;We can see that in clicks_raw there are two rows for this combination:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; - one with 20 clicks&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; - another one with 15 clicks&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;So the final result in our target table for this user and this department should be 35.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="image (1).png" style="width: 999px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/9639i8AA485158A7F70AD/image-size/large/is-moderation-mode/true?v=v2&amp;amp;px=999" role="button" title="image (1).png" alt="image (1).png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;CREATE OR REFRESH STREAMING TABLE clicks_aggregated AS SELECT user_id, department_id, SUM(clicks) as clicks FROM STREAM(LIVE.clicks_raw) GROUP BY user_id, department_id&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;As you can see from screen below I achieved&amp;nbsp;&lt;SPAN&gt;incremental refresh.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="image (2).png" style="width: 999px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/9640i3871DD5F518CFB3F/image-size/large/is-moderation-mode/true?v=v2&amp;amp;px=999" role="button" title="image (2).png" alt="image (2).png" /&gt;&lt;/span&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 17 Jul 2024 08:26:21 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/delta-table-with-unique-columns-incremental-refresh/m-p/79069#M35667</guid>
      <dc:creator>szymon_dybczak</dc:creator>
      <dc:date>2024-07-17T08:26:21Z</dc:date>
    </item>
    <item>
      <title>Re: Delta table with unique columns incremental refresh</title>
      <link>https://community.databricks.com/t5/data-engineering/delta-table-with-unique-columns-incremental-refresh/m-p/79328#M35736</link>
      <description>&lt;P&gt;Thank you very much&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/110502"&gt;@szymon_dybczak&lt;/a&gt;&amp;nbsp; this worked for me.&lt;/P&gt;&lt;P&gt;But I wonder then what is the difference between streaming table and materialized view as we can incrementally "UPDATE" the streaming table also?&lt;/P&gt;</description>
      <pubDate>Fri, 19 Jul 2024 05:59:43 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/delta-table-with-unique-columns-incremental-refresh/m-p/79328#M35736</guid>
      <dc:creator>MyTrh</dc:creator>
      <dc:date>2024-07-19T05:59:43Z</dc:date>
    </item>
  </channel>
</rss>

