<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic OOM Issue in Streaming with foreachBatch() in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/oom-issue-in-streaming-with-foreachbatch/m-p/71493#M34326</link>
    <description>&lt;DIV&gt;I have a stateless streaming application that uses foreachBatch. This function executes between 10-400 times each hour based on custom logic.&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;The logic within foreachBatch includes:&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;OL&gt;&lt;LI&gt;collect() on very small DataFrames (a few megabytes) --&amp;gt; driver memory is more than 20GB so it shouldn't be an issue&lt;/LI&gt;&lt;LI&gt;Caching DataFrames and then unpersisting them&amp;nbsp;&lt;/LI&gt;&lt;LI&gt;Converting a single row to a DF&lt;/LI&gt;&lt;LI&gt;Performing a cross join on a very small DataFrame&amp;nbsp;&lt;/LI&gt;&lt;LI&gt;Various filtering operations&amp;nbsp;&lt;/LI&gt;&lt;LI&gt;Writing the DataFrame to the target_table in append mode.&lt;/LI&gt;&lt;/OL&gt;&lt;/DIV&gt;&lt;DIV&gt;The driver memory usage gradually increases over a few days until it eventually hits a Driver Out of Memory (OOM) error.&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;UL&gt;&lt;LI&gt;When does Spark remove state from the driver metadata in a streaming application? Are there configurations to force more aggressive cleanup?&lt;/LI&gt;&lt;LI&gt;Could frequent calls to&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;collect()&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;on small DataFrames still cause driver OOM issues? What alternatives can I use?&lt;/LI&gt;&lt;LI&gt;Should I avoid caching DataFrames even if they are used multiple times within a microbatch? How can I optimize the caching strategy?&lt;/LI&gt;&lt;LI&gt;Are there specific configurations or practices to better manage driver metadata and prevent memory bloat?&lt;/LI&gt;&lt;/UL&gt;&lt;/DIV&gt;&lt;DIV&gt;The goal is to find a solution to manage and optimize driver memory usage effectively.&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;I look forward to your suggestions and insights on resolving this issue.&lt;/DIV&gt;</description>
    <pubDate>Mon, 03 Jun 2024 15:56:45 GMT</pubDate>
    <dc:creator>dzsuzs</dc:creator>
    <dc:date>2024-06-03T15:56:45Z</dc:date>
    <item>
      <title>OOM Issue in Streaming with foreachBatch()</title>
      <link>https://community.databricks.com/t5/data-engineering/oom-issue-in-streaming-with-foreachbatch/m-p/71493#M34326</link>
      <description>&lt;DIV&gt;I have a stateless streaming application that uses foreachBatch. This function executes between 10-400 times each hour based on custom logic.&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;The logic within foreachBatch includes:&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;OL&gt;&lt;LI&gt;collect() on very small DataFrames (a few megabytes) --&amp;gt; driver memory is more than 20GB so it shouldn't be an issue&lt;/LI&gt;&lt;LI&gt;Caching DataFrames and then unpersisting them&amp;nbsp;&lt;/LI&gt;&lt;LI&gt;Converting a single row to a DF&lt;/LI&gt;&lt;LI&gt;Performing a cross join on a very small DataFrame&amp;nbsp;&lt;/LI&gt;&lt;LI&gt;Various filtering operations&amp;nbsp;&lt;/LI&gt;&lt;LI&gt;Writing the DataFrame to the target_table in append mode.&lt;/LI&gt;&lt;/OL&gt;&lt;/DIV&gt;&lt;DIV&gt;The driver memory usage gradually increases over a few days until it eventually hits a Driver Out of Memory (OOM) error.&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;UL&gt;&lt;LI&gt;When does Spark remove state from the driver metadata in a streaming application? Are there configurations to force more aggressive cleanup?&lt;/LI&gt;&lt;LI&gt;Could frequent calls to&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;collect()&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;on small DataFrames still cause driver OOM issues? What alternatives can I use?&lt;/LI&gt;&lt;LI&gt;Should I avoid caching DataFrames even if they are used multiple times within a microbatch? How can I optimize the caching strategy?&lt;/LI&gt;&lt;LI&gt;Are there specific configurations or practices to better manage driver metadata and prevent memory bloat?&lt;/LI&gt;&lt;/UL&gt;&lt;/DIV&gt;&lt;DIV&gt;The goal is to find a solution to manage and optimize driver memory usage effectively.&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;I look forward to your suggestions and insights on resolving this issue.&lt;/DIV&gt;</description>
      <pubDate>Mon, 03 Jun 2024 15:56:45 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/oom-issue-in-streaming-with-foreachbatch/m-p/71493#M34326</guid>
      <dc:creator>dzsuzs</dc:creator>
      <dc:date>2024-06-03T15:56:45Z</dc:date>
    </item>
    <item>
      <title>Re: OOM Issue in Streaming with foreachBatch()</title>
      <link>https://community.databricks.com/t5/data-engineering/oom-issue-in-streaming-with-foreachbatch/m-p/71520#M34337</link>
      <description>&lt;P&gt;From the information you provided, your issue might be resolved by setting a watermark on the streaming dataframe. The purpose of watermarks is to set a maximum time for records to be retained in state. Without a watermark, records in your state will accumulate in memory, eventually resulting in an OOM error. Additionally, your job could have other performance hits as state accumulates over time.&lt;/P&gt;&lt;P&gt;In your case, assuming it's not necessary to retain all records in state over the lifetime of the job, you should set a reasonable window for records to be removed from state. For example, you could apply a 10 minute watermark like this:&lt;/P&gt;&lt;P&gt;`df.withWatermark("event_time", "10 minutes")`&lt;/P&gt;&lt;P&gt;Please refer to this Databricks documentation article on watermarks, including code examples: &lt;A href="https://docs.databricks.com/en/structured-streaming/watermarks.html" target="_blank"&gt;https://docs.databricks.com/en/structured-streaming/watermarks.html&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 03 Jun 2024 19:33:27 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/oom-issue-in-streaming-with-foreachbatch/m-p/71520#M34337</guid>
      <dc:creator>xorbix_rshiva</dc:creator>
      <dc:date>2024-06-03T19:33:27Z</dc:date>
    </item>
    <item>
      <title>Re: OOM Issue in Streaming with foreachBatch()</title>
      <link>https://community.databricks.com/t5/data-engineering/oom-issue-in-streaming-with-foreachbatch/m-p/71527#M34339</link>
      <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/94141"&gt;@xorbix_rshiva&lt;/a&gt;&amp;nbsp;thanks for the reply! The streaming app does not keep state (foreachbatch), so watermark is unfortunately irrelevant and is not the solution here.&lt;/P&gt;</description>
      <pubDate>Mon, 03 Jun 2024 21:28:34 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/oom-issue-in-streaming-with-foreachbatch/m-p/71527#M34339</guid>
      <dc:creator>dzsuzs</dc:creator>
      <dc:date>2024-06-03T21:28:34Z</dc:date>
    </item>
    <item>
      <title>Re: OOM Issue in Streaming with foreachBatch()</title>
      <link>https://community.databricks.com/t5/data-engineering/oom-issue-in-streaming-with-foreachbatch/m-p/110888#M43729</link>
      <description>&lt;P&gt;Did you ever figure out what is causing the memory leak?&amp;nbsp; We are experiencing a nearly identical issue where the memory gradually increases over time and OOM after a few days.&amp;nbsp;&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;I did track down this open bug ticket that states there is a memory leak when a dataset is persisted even if it is unpersisted.&lt;BR /&gt;&lt;BR /&gt;&lt;A href="https://issues.apache.org/jira/browse/SPARK-35262" target="_blank"&gt;https://issues.apache.org/jira/browse/SPARK-35262&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 21 Feb 2025 16:41:32 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/oom-issue-in-streaming-with-foreachbatch/m-p/110888#M43729</guid>
      <dc:creator>gardnmi1983</dc:creator>
      <dc:date>2025-02-21T16:41:32Z</dc:date>
    </item>
  </channel>
</rss>

