<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Microbatching incremental updates Delta Live Tables in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/microbatching-incremental-updates-delta-live-tables/m-p/49931#M28656</link>
    <description>&lt;P&gt;I need to create a workflow that pulls recent data from a database every two minutes, then transforms that data in various ways, and appends the results to a final table. The problem is that some of these changes _might_ update existing rows in the final table and I need to resolve the differences, because only columns with new data should be updated. That is, sometimes data can be delayed for a specific `event_time`. For example, `did_foo_value_exceed_n` should be updated when a foo comes in for an older `event_time`.&lt;/P&gt;&lt;P&gt;Anyway, I attempted to do this in Delta Live Tables. However, you cannot pull from a future table to join and merge changes before applying a CDC. I created a normal PySpark script that runs the merge and applies the merge with DeltaTable, but this cannot be used with a Delta Live Tables pipeline, because Workflows don't allow separate compute (Delta Live Tables compute vs Workflow compute) to access the same tables, so I can't take the result of the Delta Live Tables pipeline.&lt;/P&gt;&lt;P&gt;The biggest issue is that I can't use a triggered workflow because the time to retrieve compute is longer than the time I need to run this pipeline. Is there any way I can keep compute between Workflow runs?&lt;/P&gt;</description>
    <pubDate>Thu, 26 Oct 2023 17:15:53 GMT</pubDate>
    <dc:creator>Erik_L</dc:creator>
    <dc:date>2023-10-26T17:15:53Z</dc:date>
    <item>
      <title>Microbatching incremental updates Delta Live Tables</title>
      <link>https://community.databricks.com/t5/data-engineering/microbatching-incremental-updates-delta-live-tables/m-p/49931#M28656</link>
      <description>&lt;P&gt;I need to create a workflow that pulls recent data from a database every two minutes, then transforms that data in various ways, and appends the results to a final table. The problem is that some of these changes _might_ update existing rows in the final table and I need to resolve the differences, because only columns with new data should be updated. That is, sometimes data can be delayed for a specific `event_time`. For example, `did_foo_value_exceed_n` should be updated when a foo comes in for an older `event_time`.&lt;/P&gt;&lt;P&gt;Anyway, I attempted to do this in Delta Live Tables. However, you cannot pull from a future table to join and merge changes before applying a CDC. I created a normal PySpark script that runs the merge and applies the merge with DeltaTable, but this cannot be used with a Delta Live Tables pipeline, because Workflows don't allow separate compute (Delta Live Tables compute vs Workflow compute) to access the same tables, so I can't take the result of the Delta Live Tables pipeline.&lt;/P&gt;&lt;P&gt;The biggest issue is that I can't use a triggered workflow because the time to retrieve compute is longer than the time I need to run this pipeline. Is there any way I can keep compute between Workflow runs?&lt;/P&gt;</description>
      <pubDate>Thu, 26 Oct 2023 17:15:53 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/microbatching-incremental-updates-delta-live-tables/m-p/49931#M28656</guid>
      <dc:creator>Erik_L</dc:creator>
      <dc:date>2023-10-26T17:15:53Z</dc:date>
    </item>
    <item>
      <title>Re: Microbatching incremental updates Delta Live Tables</title>
      <link>https://community.databricks.com/t5/data-engineering/microbatching-incremental-updates-delta-live-tables/m-p/50291#M28755</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/35139"&gt;@Erik_L&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;
&lt;P&gt;Just a friendly follow-up. Have you had a chance to review my colleague's response to your inquiry? Did it prove helpful, or are you still in need of assistance? Your response would be greatly appreciated.&lt;/P&gt;</description>
      <pubDate>Wed, 01 Nov 2023 17:19:37 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/microbatching-incremental-updates-delta-live-tables/m-p/50291#M28755</guid>
      <dc:creator>jose_gonzalez</dc:creator>
      <dc:date>2023-11-01T17:19:37Z</dc:date>
    </item>
    <item>
      <title>Re: Microbatching incremental updates Delta Live Tables</title>
      <link>https://community.databricks.com/t5/data-engineering/microbatching-incremental-updates-delta-live-tables/m-p/50807#M28895</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/35139"&gt;@Erik_L&lt;/a&gt;,&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;As my colleague mentioned, to ensure continuous operation of the Delta Live Tables pipeline compute during Workflow runs, choosing a prolonged Databricks Job over a triggered Databricks Workflow is a reliable strategy. This extended job will maintain an ongoing Spark context, enabling the seamless execution of essential data transformations and merging tasks.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;Please let me know if it resolves the issue.&lt;/P&gt;</description>
      <pubDate>Fri, 10 Nov 2023 09:22:10 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/microbatching-incremental-updates-delta-live-tables/m-p/50807#M28895</guid>
      <dc:creator>Manisha_Jena</dc:creator>
      <dc:date>2023-11-10T09:22:10Z</dc:date>
    </item>
  </channel>
</rss>

