<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Iterative read and writes cause java.lang.OutOfMemoryError: GC overhead limit exceeded in Get Started Discussions</title>
    <link>https://community.databricks.com/t5/get-started-discussions/iterative-read-and-writes-cause-java-lang-outofmemoryerror-gc/m-p/38316#M5510</link>
    <description>&lt;P&gt;I forgot to mention that on the df creation I am using filter method, cause actually p is an object:&lt;/P&gt;&lt;P&gt;{cntr_id : 12, secure_key: 15, load_dates: [date1, date 2, ...]. The filter looks like:&lt;/P&gt;&lt;P&gt;df = spark.read.parquet("adls_storage").where((col(cntr_id) == p[cntr_id]) &amp;amp; (col(load_date).isin(p[load_dates])&lt;/P&gt;</description>
    <pubDate>Tue, 25 Jul 2023 06:31:33 GMT</pubDate>
    <dc:creator>Chalki</dc:creator>
    <dc:date>2023-07-25T06:31:33Z</dc:date>
    <item>
      <title>Iterative read and writes cause java.lang.OutOfMemoryError: GC overhead limit exceeded</title>
      <link>https://community.databricks.com/t5/get-started-discussions/iterative-read-and-writes-cause-java-lang-outofmemoryerror-gc/m-p/38314#M5509</link>
      <description>&lt;P&gt;I have an iterative algorithm which read and writes a dataframe iteration trough a list with new partitions, like this:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;for p in partitions_list:&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;df = spark.read.parquet("adls_storage/p")&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;df.write.&lt;/SPAN&gt;&lt;SPAN&gt;format&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"delta"&lt;/SPAN&gt;&lt;SPAN&gt;).mode(&lt;/SPAN&gt;&lt;SPAN&gt;"overwrite"&lt;/SPAN&gt;&lt;SPAN&gt;).option(&lt;/SPAN&gt;&lt;SPAN&gt;"partitionOverwriteMode"&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;SPAN&gt;"dynamic"&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;.saveAsTable(schema.my_delta_table)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Max partition data size is 2 tb overall. The job very often succeed after the 4th rerun of the pipeline. Very often it fails due to GC overhead limit exceeded. Also in the standard output I observe many GC allocation failures. Check the screenshot pls.&lt;/P&gt;&lt;P&gt;Looks like the execution plan of the previous dataframes stays in the memory of the driver. Is this so?&lt;BR /&gt;Is there a way to purge it after each iteration?&lt;/P&gt;</description>
      <pubDate>Tue, 25 Jul 2023 06:22:24 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/iterative-read-and-writes-cause-java-lang-outofmemoryerror-gc/m-p/38314#M5509</guid>
      <dc:creator>Chalki</dc:creator>
      <dc:date>2023-07-25T06:22:24Z</dc:date>
    </item>
    <item>
      <title>Re: Iterative read and writes cause java.lang.OutOfMemoryError: GC overhead limit exceeded</title>
      <link>https://community.databricks.com/t5/get-started-discussions/iterative-read-and-writes-cause-java-lang-outofmemoryerror-gc/m-p/38316#M5510</link>
      <description>&lt;P&gt;I forgot to mention that on the df creation I am using filter method, cause actually p is an object:&lt;/P&gt;&lt;P&gt;{cntr_id : 12, secure_key: 15, load_dates: [date1, date 2, ...]. The filter looks like:&lt;/P&gt;&lt;P&gt;df = spark.read.parquet("adls_storage").where((col(cntr_id) == p[cntr_id]) &amp;amp; (col(load_date).isin(p[load_dates])&lt;/P&gt;</description>
      <pubDate>Tue, 25 Jul 2023 06:31:33 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/iterative-read-and-writes-cause-java-lang-outofmemoryerror-gc/m-p/38316#M5510</guid>
      <dc:creator>Chalki</dc:creator>
      <dc:date>2023-07-25T06:31:33Z</dc:date>
    </item>
    <item>
      <title>Re: Iterative read and writes cause java.lang.OutOfMemoryError: GC overhead limit exceeded</title>
      <link>https://community.databricks.com/t5/get-started-discussions/iterative-read-and-writes-cause-java-lang-outofmemoryerror-gc/m-p/38443#M5511</link>
      <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/62579"&gt;@Chalki&lt;/a&gt;&amp;nbsp;&lt;BR /&gt;GC Allocation Failure is a little bit confusing - it indicates that GC kicks in because there's not enough memory left in heap. That's normal and you shouldn't worry about GC Allocation Failure.&lt;BR /&gt;&lt;BR /&gt;What worries more is "GC overhead limit exceeded", it means that JVM spent too much time doing GC and there was no big gain out of it.&lt;BR /&gt;&lt;BR /&gt;Without doing a proper debugging of your code I would say - just scale up.&lt;/P&gt;</description>
      <pubDate>Wed, 26 Jul 2023 05:45:03 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/iterative-read-and-writes-cause-java-lang-outofmemoryerror-gc/m-p/38443#M5511</guid>
      <dc:creator>daniel_sahal</dc:creator>
      <dc:date>2023-07-26T05:45:03Z</dc:date>
    </item>
    <item>
      <title>Re: Iterative read and writes cause java.lang.OutOfMemoryError: GC overhead limit exceeded</title>
      <link>https://community.databricks.com/t5/get-started-discussions/iterative-read-and-writes-cause-java-lang-outofmemoryerror-gc/m-p/38455#M5512</link>
      <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/79106"&gt;@daniel_sahal&lt;/a&gt;I've attached the wrong snip/ Actually it is FULL GC Ergonomics, which was bothering me. Now I am attaching the correct snip.&amp;nbsp; But as you said I scaled a bit. The thing I forgot to mention is that the table is wide - more than 300 columns. I am not creating extra objects inside the loop except of the dataframe on each iteration, but it gets overwritten on the next one.&lt;BR /&gt;I still can't figure out how is the memory building so much in the driver node. Could you give me&amp;nbsp; some more details about it? For my own knowledge&lt;/P&gt;</description>
      <pubDate>Wed, 26 Jul 2023 07:55:48 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/iterative-read-and-writes-cause-java-lang-outofmemoryerror-gc/m-p/38455#M5512</guid>
      <dc:creator>Chalki</dc:creator>
      <dc:date>2023-07-26T07:55:48Z</dc:date>
    </item>
  </channel>
</rss>

