<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: How to proactively monitor the use of the cache for driver node? in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/how-to-proactively-monitor-the-use-of-the-cache-for-driver-node/m-p/31765#M23142</link>
    <description>&lt;P&gt;There is a &lt;A href="https://spark.apache.org/docs/latest/api/java/index.html?org/apache/spark/util/SizeEstimator.html" alt="https://spark.apache.org/docs/latest/api/java/index.html?org/apache/spark/util/SizeEstimator.html" target="_blank"&gt;size estimator&lt;/A&gt;.&lt;/P&gt;&lt;P&gt;But this is only an estimate so the reliability may vary.&lt;/P&gt;&lt;P&gt;&lt;A href="https://stackoverflow.com/questions/49492463/compute-size-of-spark-dataframe-sizeestimator-gives-unexpected-results" alt="https://stackoverflow.com/questions/49492463/compute-size-of-spark-dataframe-sizeestimator-gives-unexpected-results" target="_blank"&gt;Here&lt;/A&gt; is an option you can use, but performancewise this is suboptimal (as you have to cache).&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
    <pubDate>Fri, 14 Jan 2022 08:22:15 GMT</pubDate>
    <dc:creator>-werners-</dc:creator>
    <dc:date>2022-01-14T08:22:15Z</dc:date>
    <item>
      <title>How to proactively monitor the use of the cache for driver node?</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-proactively-monitor-the-use-of-the-cache-for-driver-node/m-p/31763#M23140</link>
      <description>&lt;P&gt;&lt;B&gt;The problem:&lt;/B&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;We have a dataframe which is based on the query:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;SELECT *
FROM Very_Big_Table&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;This table returns over 4 GB of data, and when we try to push the data to Power BI we get the error message:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;ODBC: ERROR [HY000] [Microsoft][Hardy] (35) Error from server: error code: '0' error message: 'Error running query: org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 87 tasks (4.0 GiB) is bigger than spark.driver.maxResultSize 4.0 GiB.'.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;In order to deal with this error we've done the following :&lt;/P&gt;&lt;P&gt;1.&amp;nbsp;We've changed Cluster spark configuration for the driver.maxresultSize to 10GB -&amp;nbsp;&lt;B&gt;spark.driver.maxResultSize 10g&lt;/B&gt;. Now the data comes in perfectly.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;2. We added a limitation on the data coming from Very_Big_Table (a where clause to limit the size of data to the past 7 days).&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;B&gt;What do we want to achieve?&lt;/B&gt;&lt;/P&gt;&lt;P&gt;We want to be proactive about the process. In order to make sure that this error wouldn't happen again, we were thinking about a clearance warning. We want to be able to know - in advance - when we are close to hitting the cache limit, so the refresh would happen smoothly, and we would stop the refresh process and get some sort of notification to see that the size that is being pulled is too big.&amp;nbsp;Or, if we see that we are close to the limit of 10 GB with the data that we pull, we could consider changing the configuration of the driver before this happens/limit the data that is pulled from the source table. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Is this information in the log? Can we get the size of the dataframe inside Databricks before we try to send it to Power BI so the cache can accommodate the data?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Please let us know.&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thanks!&lt;/P&gt;</description>
      <pubDate>Wed, 12 Jan 2022 22:40:56 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-proactively-monitor-the-use-of-the-cache-for-driver-node/m-p/31763#M23140</guid>
      <dc:creator>Hila_DG</dc:creator>
      <dc:date>2022-01-12T22:40:56Z</dc:date>
    </item>
    <item>
      <title>Re: How to proactively monitor the use of the cache for driver node?</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-proactively-monitor-the-use-of-the-cache-for-driver-node/m-p/31764#M23141</link>
      <description>&lt;P&gt;@Hila Galapo​&amp;nbsp;- Welcome and thanks for your question! We'll give the community a chance to respond before we circle back around.&lt;/P&gt;</description>
      <pubDate>Thu, 13 Jan 2022 01:28:31 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-proactively-monitor-the-use-of-the-cache-for-driver-node/m-p/31764#M23141</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2022-01-13T01:28:31Z</dc:date>
    </item>
    <item>
      <title>Re: How to proactively monitor the use of the cache for driver node?</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-proactively-monitor-the-use-of-the-cache-for-driver-node/m-p/31765#M23142</link>
      <description>&lt;P&gt;There is a &lt;A href="https://spark.apache.org/docs/latest/api/java/index.html?org/apache/spark/util/SizeEstimator.html" alt="https://spark.apache.org/docs/latest/api/java/index.html?org/apache/spark/util/SizeEstimator.html" target="_blank"&gt;size estimator&lt;/A&gt;.&lt;/P&gt;&lt;P&gt;But this is only an estimate so the reliability may vary.&lt;/P&gt;&lt;P&gt;&lt;A href="https://stackoverflow.com/questions/49492463/compute-size-of-spark-dataframe-sizeestimator-gives-unexpected-results" alt="https://stackoverflow.com/questions/49492463/compute-size-of-spark-dataframe-sizeestimator-gives-unexpected-results" target="_blank"&gt;Here&lt;/A&gt; is an option you can use, but performancewise this is suboptimal (as you have to cache).&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 14 Jan 2022 08:22:15 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-proactively-monitor-the-use-of-the-cache-for-driver-node/m-p/31765#M23142</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2022-01-14T08:22:15Z</dc:date>
    </item>
    <item>
      <title>Re: How to proactively monitor the use of the cache for driver node?</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-proactively-monitor-the-use-of-the-cache-for-driver-node/m-p/31766#M23143</link>
      <description>&lt;P&gt;As it is just SELECT for BI tool I strongly recommend start using SQL serverless endpoint. It is available in Premium version (you can always have two workspaces in Azure standard and premium in the same time). In my opinion it is more stable and also sometimes cheaper as you don't need VMs.&lt;/P&gt;</description>
      <pubDate>Fri, 14 Jan 2022 14:35:43 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-proactively-monitor-the-use-of-the-cache-for-driver-node/m-p/31766#M23143</guid>
      <dc:creator>Hubert-Dudek</dc:creator>
      <dc:date>2022-01-14T14:35:43Z</dc:date>
    </item>
    <item>
      <title>Re: How to proactively monitor the use of the cache for driver node?</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-proactively-monitor-the-use-of-the-cache-for-driver-node/m-p/31767#M23144</link>
      <description>&lt;P&gt;@Hila Galapo​&amp;nbsp;- Do these answers help you?  If yes, would you be happy to mark it as best so that other members can find the solution more quickly?&lt;/P&gt;</description>
      <pubDate>Wed, 26 Jan 2022 16:27:12 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-proactively-monitor-the-use-of-the-cache-for-driver-node/m-p/31767#M23144</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2022-01-26T16:27:12Z</dc:date>
    </item>
    <item>
      <title>Re: How to proactively monitor the use of the cache for driver node?</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-proactively-monitor-the-use-of-the-cache-for-driver-node/m-p/31768#M23145</link>
      <description>&lt;P&gt;Hey @Hila Galapo​&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Hope everything is going good. Just wanted to check in if you were able to resolve your issue or do you need more help? We'd love to hear from you.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thanks!&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 13 May 2022 12:23:11 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-proactively-monitor-the-use-of-the-cache-for-driver-node/m-p/31768#M23145</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2022-05-13T12:23:11Z</dc:date>
    </item>
  </channel>
</rss>

