<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: How to use foreachbatch in deltalivetable or DLT? in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/how-to-use-foreachbatch-in-deltalivetable-or-dlt/m-p/14068#M8620</link>
    <description>&lt;P&gt;You were running out of memory using .dropDuplicates on a stream because you need to specify a streaming watermark to define a threshold at which late data can be ignored and the state no longer needs to be kept for that time frame.&lt;/P&gt;</description>
    <pubDate>Thu, 22 Jun 2023 02:49:50 GMT</pubDate>
    <dc:creator>JohnA</dc:creator>
    <dc:date>2023-06-22T02:49:50Z</dc:date>
    <item>
      <title>How to use foreachbatch in deltalivetable or DLT?</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-use-foreachbatch-in-deltalivetable-or-dlt/m-p/14064#M8616</link>
      <description>&lt;P&gt;I need to process some transformation on incoming data as a batch and want to know if there is way to use foreachbatch option in deltalivetable. I am using autoloader to load json files and then I need to apply foreachbatch and store results into another table.&lt;/P&gt;</description>
      <pubDate>Mon, 11 Jul 2022 15:20:14 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-use-foreachbatch-in-deltalivetable-or-dlt/m-p/14064#M8616</guid>
      <dc:creator>rdobbss</dc:creator>
      <dc:date>2022-07-11T15:20:14Z</dc:date>
    </item>
    <item>
      <title>Re: How to use foreachbatch in deltalivetable or DLT?</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-use-foreachbatch-in-deltalivetable-or-dlt/m-p/14066#M8618</link>
      <description>&lt;P&gt;@Kaniz Fatma​&amp;nbsp;I am aware about this but I am more specifically looking for using foreach in Delta Live Table pipeline. I am aware how it can be achieved in regular notebooks but haven't found anything for Delta Live Table&lt;/P&gt;</description>
      <pubDate>Thu, 14 Jul 2022 14:32:25 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-use-foreachbatch-in-deltalivetable-or-dlt/m-p/14066#M8618</guid>
      <dc:creator>rdobbss</dc:creator>
      <dc:date>2022-07-14T14:32:25Z</dc:date>
    </item>
    <item>
      <title>Re: How to use foreachbatch in deltalivetable or DLT?</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-use-foreachbatch-in-deltalivetable-or-dlt/m-p/14067#M8619</link>
      <description>&lt;P&gt;Not sure if this will apply to you or not...&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I was looking at the foreachbatch tool to reduce the workload of getting distinct data from a history table of 20million + records because the df.dropDuplicates() function was intermittently running out of memory during DLT pipeline execution.  I ended up doing this instead:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;##define the target table to push last vals to&lt;/P&gt;&lt;P&gt;dlt.create_target_table("stg_service_requests_unpacked_new_distinct")&lt;/P&gt;&lt;P&gt;#use the apply changes function to perform the merge&lt;/P&gt;&lt;P&gt;dlt.apply_changes(&lt;/P&gt;&lt;P&gt;&amp;nbsp;target = "stg_service_requests_unpacked_new_distinct",&lt;/P&gt;&lt;P&gt;&amp;nbsp;source = "stg_service_requests_unpacked_new",&lt;/P&gt;&lt;P&gt;&amp;nbsp;keys = dupe_cols_evaluation, &lt;/P&gt;&lt;P&gt;&amp;nbsp;sequence_by = col("_hdr_time_in_ms"),&lt;/P&gt;&lt;P&gt;&amp;nbsp;)&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;dupe_cols_evaluation is a python list where I defined the columns to evaluate for de-duplication.  The outputs appear to be correct and running incremental updates is very speedy with this process.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 18 Jan 2023 19:33:47 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-use-foreachbatch-in-deltalivetable-or-dlt/m-p/14067#M8619</guid>
      <dc:creator>TomRenish</dc:creator>
      <dc:date>2023-01-18T19:33:47Z</dc:date>
    </item>
    <item>
      <title>Re: How to use foreachbatch in deltalivetable or DLT?</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-use-foreachbatch-in-deltalivetable-or-dlt/m-p/14068#M8620</link>
      <description>&lt;P&gt;You were running out of memory using .dropDuplicates on a stream because you need to specify a streaming watermark to define a threshold at which late data can be ignored and the state no longer needs to be kept for that time frame.&lt;/P&gt;</description>
      <pubDate>Thu, 22 Jun 2023 02:49:50 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-use-foreachbatch-in-deltalivetable-or-dlt/m-p/14068#M8620</guid>
      <dc:creator>JohnA</dc:creator>
      <dc:date>2023-06-22T02:49:50Z</dc:date>
    </item>
  </channel>
</rss>

