<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Cluster Memory Issue (Termination) in Get Started Discussions</title>
    <link>https://community.databricks.com/t5/get-started-discussions/cluster-memory-issue-termination/m-p/71871#M9028</link>
    <description>&lt;P&gt;Hello,&lt;/P&gt;&lt;P&gt;Thank you for your answer.&amp;nbsp;It seems to me that the reply is GPT answer. I would expect an answer from community as a person as I have tried to solve the issue with GPT already.&amp;nbsp;&lt;/P&gt;&lt;P&gt;Nevertheless:&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;1) Initial Memory Allocation:&amp;nbsp;&lt;/STRONG&gt;Adjusting memory configuration might be a solution but my question here is that how I can do that, based on what metrics? What is the technical explanation of the issue and solution?&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;2)&amp;nbsp;Memory Consumption with Dataframes:&amp;nbsp;&lt;/STRONG&gt;I am training a ML model with Logistic Regression and LightGBM with Optuna. PySpark does not provide the configuration of these ML models and hyperparam optimization so I must do toPandas() conversion and use scikit-learn and lightgbm libraries.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;3)&amp;nbsp;GC (Allocation Failure):&amp;nbsp;&lt;/STRONG&gt;Could you please provide a documentation, blog, book or any feature implementation regarding all of these so I can understand the underlying issue here?&lt;/P&gt;&lt;P&gt;After talking with Databricks Core Team, firstly, I was told that problem is not memory but networking issue:&lt;/P&gt;&lt;P&gt;"The network issue had caused the driver's IP to be out of reach, and hence, the Chauffeur assumed that the driver was dead, marked it as dead and restarted a new driver PID. Since a driver was restarted, the job failed and it should be temporary."&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;The problem is not temporary and it happens in irregular intervals.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Screenshot 2024-06-06 at 11.34.27.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/8097i2CEAA03983BB6FB5/image-size/medium/is-moderation-mode/true?v=v2&amp;amp;px=400" role="button" title="Screenshot 2024-06-06 at 11.34.27.png" alt="Screenshot 2024-06-06 at 11.34.27.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;For LightGBM training these are the parameters I am trying with Optuna:&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="egndz_0-1717670214321.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/8098i2841AA587D72CA33/image-size/medium/is-moderation-mode/true?v=v2&amp;amp;px=400" role="button" title="egndz_0-1717670214321.png" alt="egndz_0-1717670214321.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I have seen that playing with n_jobs=1 or n_jobs=5 helped me to reduce the rate of error happening in my trials. However, I have observed that when n_jobs=1, jobs with smalller dataset(~150MB) finish faster compared&amp;nbsp; n_jobs=5 where cross validation should be parallel and faster, which is an unexpected case. When I set n_jobs more than 1, seeing the error chance incrases.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Screenshot 2024-06-06 at 11.37.17.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/8099i8DCD2C125559EF0A/image-size/medium/is-moderation-mode/true?v=v2&amp;amp;px=400" role="button" title="Screenshot 2024-06-06 at 11.37.17.png" alt="Screenshot 2024-06-06 at 11.37.17.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;I believe the error is coming from the threading with Optuna and LightGBM (same happens in the Logreg) now. I wonder somehow Optuna(3.5.0), lightgbm(4.3.0) and joblib(1.2.0) libraries creating the problem in the runtime. I am still keep seeing the GC  during the runs as I expect them to see because I am using &lt;A href="https://optuna.readthedocs.io/en/stable/_modules/optuna/study/study.html#Study.optimize" target="_self"&gt;Optuna.study.optimize&lt;/A&gt; function with&amp;nbsp;&lt;/P&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;gc_after_trial=True .&amp;nbsp;&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Screenshot 2024-06-06 at 11.44.48.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/8100i176A7F2A3E709529/image-size/medium/is-moderation-mode/true?v=v2&amp;amp;px=400" role="button" title="Screenshot 2024-06-06 at 11.44.48.png" alt="Screenshot 2024-06-06 at 11.44.48.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;I would literally appreciate a lot from the community if someone has an answer for this. I am willing to have a meeting and talk with anyone at this point.&lt;/P&gt;&lt;P&gt;Thanks!&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Thu, 06 Jun 2024 10:49:14 GMT</pubDate>
    <dc:creator>egndz</dc:creator>
    <dc:date>2024-06-06T10:49:14Z</dc:date>
    <item>
      <title>Cluster Memory Issue (Termination)</title>
      <link>https://community.databricks.com/t5/get-started-discussions/cluster-memory-issue-termination/m-p/66076#M9026</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;I have a single-node personal cluster with 56GB memory(Node type: Standard_DS5_v2, runtime: 14.3 LTS ML). The same configuration is done for the job cluster as well and the following problem applies to both clusters:&lt;/P&gt;&lt;P&gt;To start with: once I start my cluster without attaching anything, I have high memory allocation which 18 GB is used and 4.1 GB is cached. Are all of them just Spark, Python, and my libraries? Is there a way to reduce that as it is 40% of my total memory?&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="egndz_2-1712845742934.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/7023iF34EA3EC6819CEBA/image-size/medium/is-moderation-mode/true?v=v2&amp;amp;px=400" role="button" title="egndz_2-1712845742934.png" alt="egndz_2-1712845742934.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;I am using .whl file to include my Python Libraries. Same libraries in my local development with virtual environment(python 3.10) takes 6.1GB space.&amp;nbsp;&lt;/P&gt;&lt;P&gt;For my job, I run the following code piece:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;train_index = spark.table("my_train_index_table")
test_index = spark.table("my_test_index_table")

abt_table = spark.table("my_abt_table").where('some_column is not null')
abt_table = abt_table.select(*cols_to_select)

train_pdf = abt_table.join(train_index , on=["index_col"], how="inner").toPandas()
test_pdf = abt_table.join(test_index , on=["index_col"], how="inner").toPandas()&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;my tables are all delta tables and their size is (from the catalog explorer):&lt;/P&gt;&lt;P&gt;my_train_index_table: 3.4MB - partition:1&lt;/P&gt;&lt;P&gt;my_test_index_table: 870KB - partition:1&lt;/P&gt;&lt;P&gt;my_abt_table: 3.8GB - partition: 40&amp;nbsp;&lt;/P&gt;&lt;P&gt;my_abt_table on pandas after where clause: 5.5GB. This is for analysis purpose, I don't convert this spark df to pandas&lt;/P&gt;&lt;P&gt;my_abt_table on pandas after column selection(lots of String Type): 2.7GB&amp;nbsp;This is for analysis purpose, I don't convert this spark df to pandas.&lt;/P&gt;&lt;P&gt;---&amp;nbsp;&lt;/P&gt;&lt;P&gt;After running the above code cell, 2 pandas frames are created:&lt;/P&gt;&lt;P&gt;train_pdf is 495 MB&lt;/P&gt;&lt;P&gt;test_pdf is 123.7 MB&lt;/P&gt;&lt;P&gt;At this point when I look at the driver logs, I see that GC (Allocation Failure).&lt;/P&gt;&lt;P&gt;My driver info is as follows:&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="egndz_1-1712845616736.png" style="width: 810px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/7022i68BF66C06CAB2CFA/image-dimensions/810x85/is-moderation-mode/true?v=v2" width="810" height="85" role="button" title="egndz_1-1712845616736.png" alt="egndz_1-1712845616736.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;Peak Heap memory is 29GB which I can't make sense in this case.&lt;/P&gt;&lt;P&gt;I tried the following solutions both individually and combined:&lt;/P&gt;&lt;P&gt;1) As Arrow is enabled in my cluster, I added `&lt;SPAN&gt;spark.sql.execution.arrow.pyspark.selfDestruct.enabled True` config to my cluster to free the memory during toPandas() conversion, defined &lt;A href="https://issues.apache.org/jira/browse/SPARK-32953" target="_self"&gt;here.&lt;/A&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;2) Based on &lt;A href="https://www.databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-applications.html" target="_self"&gt;this blog&lt;/A&gt; I have tried G1GC for garbage collection with `&lt;SPAN&gt;XX:InitiatingHeapOccupancyPercent=35 -XX:ConcGCThread=20` and ended up with GC (Allocation Failure) again.&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Based on my trials, I can see that something is blocking the GC to free the memory so eventually I get:&lt;BR /&gt;`The spark driver has stopped unexpectedly and is restarting. Your notebook will be automatically reattached.`&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;My main two question is:&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;1) Why the initial memory is equal to 40% of my total memory? Is it spark, python and my libraries?&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;2) With my train_pdf and test_pdf, I would expect `initial memory consumption + my 2 dataframe` more or less, which should be equal to 18.6GB(used)+4.1GB(cached) + 620MB(pandas dataframes), in total 25.3GB. Instead, I end up with 46.2GB(used) + 800MB(cached), in total 47GB. How this is possible?&amp;nbsp;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;Is there anything that I cannot see on this? This is a huge blocker for me now.&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thank you!&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 11 Apr 2024 14:58:53 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/cluster-memory-issue-termination/m-p/66076#M9026</guid>
      <dc:creator>egndz</dc:creator>
      <dc:date>2024-04-11T14:58:53Z</dc:date>
    </item>
    <item>
      <title>Re: Cluster Memory Issue (Termination)</title>
      <link>https://community.databricks.com/t5/get-started-discussions/cluster-memory-issue-termination/m-p/71871#M9028</link>
      <description>&lt;P&gt;Hello,&lt;/P&gt;&lt;P&gt;Thank you for your answer.&amp;nbsp;It seems to me that the reply is GPT answer. I would expect an answer from community as a person as I have tried to solve the issue with GPT already.&amp;nbsp;&lt;/P&gt;&lt;P&gt;Nevertheless:&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;1) Initial Memory Allocation:&amp;nbsp;&lt;/STRONG&gt;Adjusting memory configuration might be a solution but my question here is that how I can do that, based on what metrics? What is the technical explanation of the issue and solution?&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;2)&amp;nbsp;Memory Consumption with Dataframes:&amp;nbsp;&lt;/STRONG&gt;I am training a ML model with Logistic Regression and LightGBM with Optuna. PySpark does not provide the configuration of these ML models and hyperparam optimization so I must do toPandas() conversion and use scikit-learn and lightgbm libraries.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;3)&amp;nbsp;GC (Allocation Failure):&amp;nbsp;&lt;/STRONG&gt;Could you please provide a documentation, blog, book or any feature implementation regarding all of these so I can understand the underlying issue here?&lt;/P&gt;&lt;P&gt;After talking with Databricks Core Team, firstly, I was told that problem is not memory but networking issue:&lt;/P&gt;&lt;P&gt;"The network issue had caused the driver's IP to be out of reach, and hence, the Chauffeur assumed that the driver was dead, marked it as dead and restarted a new driver PID. Since a driver was restarted, the job failed and it should be temporary."&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;The problem is not temporary and it happens in irregular intervals.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Screenshot 2024-06-06 at 11.34.27.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/8097i2CEAA03983BB6FB5/image-size/medium/is-moderation-mode/true?v=v2&amp;amp;px=400" role="button" title="Screenshot 2024-06-06 at 11.34.27.png" alt="Screenshot 2024-06-06 at 11.34.27.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;For LightGBM training these are the parameters I am trying with Optuna:&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="egndz_0-1717670214321.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/8098i2841AA587D72CA33/image-size/medium/is-moderation-mode/true?v=v2&amp;amp;px=400" role="button" title="egndz_0-1717670214321.png" alt="egndz_0-1717670214321.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I have seen that playing with n_jobs=1 or n_jobs=5 helped me to reduce the rate of error happening in my trials. However, I have observed that when n_jobs=1, jobs with smalller dataset(~150MB) finish faster compared&amp;nbsp; n_jobs=5 where cross validation should be parallel and faster, which is an unexpected case. When I set n_jobs more than 1, seeing the error chance incrases.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Screenshot 2024-06-06 at 11.37.17.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/8099i8DCD2C125559EF0A/image-size/medium/is-moderation-mode/true?v=v2&amp;amp;px=400" role="button" title="Screenshot 2024-06-06 at 11.37.17.png" alt="Screenshot 2024-06-06 at 11.37.17.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;I believe the error is coming from the threading with Optuna and LightGBM (same happens in the Logreg) now. I wonder somehow Optuna(3.5.0), lightgbm(4.3.0) and joblib(1.2.0) libraries creating the problem in the runtime. I am still keep seeing the GC  during the runs as I expect them to see because I am using &lt;A href="https://optuna.readthedocs.io/en/stable/_modules/optuna/study/study.html#Study.optimize" target="_self"&gt;Optuna.study.optimize&lt;/A&gt; function with&amp;nbsp;&lt;/P&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;gc_after_trial=True .&amp;nbsp;&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Screenshot 2024-06-06 at 11.44.48.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/8100i176A7F2A3E709529/image-size/medium/is-moderation-mode/true?v=v2&amp;amp;px=400" role="button" title="Screenshot 2024-06-06 at 11.44.48.png" alt="Screenshot 2024-06-06 at 11.44.48.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;I would literally appreciate a lot from the community if someone has an answer for this. I am willing to have a meeting and talk with anyone at this point.&lt;/P&gt;&lt;P&gt;Thanks!&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 06 Jun 2024 10:49:14 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/cluster-memory-issue-termination/m-p/71871#M9028</guid>
      <dc:creator>egndz</dc:creator>
      <dc:date>2024-06-06T10:49:14Z</dc:date>
    </item>
    <item>
      <title>Re: Cluster Memory Issue (Termination)</title>
      <link>https://community.databricks.com/t5/get-started-discussions/cluster-memory-issue-termination/m-p/103703#M9030</link>
      <description>&lt;P&gt;Did you find a solution for this ?&lt;/P&gt;</description>
      <pubDate>Tue, 31 Dec 2024 13:56:41 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/cluster-memory-issue-termination/m-p/103703#M9030</guid>
      <dc:creator>dataismypassion</dc:creator>
      <dc:date>2024-12-31T13:56:41Z</dc:date>
    </item>
  </channel>
</rss>

