<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic After several iteration of filter and union, the data is bigger than spark.driver.maxResultSize in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/after-several-iteration-of-filter-and-union-the-data-is-bigger/m-p/14962#M9367</link>
    <description>&lt;P&gt;The process for me to build model is:&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;filter dataset and split into two datasets&lt;/LI&gt;&lt;LI&gt;fit model based on two datasets &lt;/LI&gt;&lt;LI&gt;union two datasets&lt;/LI&gt;&lt;LI&gt;repeat 1-3 steps&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;The problem is that after several iterations, the model fitting time becomes longer dramatically, and the I got error message: org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 9587 tasks (4.0 GB) is bigger than spark.driver.maxResultSize (4.0 GB). But in fact the data columns and rows stay the same. &lt;/P&gt;&lt;P&gt;As the model fitting time also increases, I don't think increasing spark.driver.maxResultSize will solve this problem. Any suggestion? Thanks.&lt;/P&gt;</description>
    <pubDate>Wed, 22 Sep 2021 19:36:52 GMT</pubDate>
    <dc:creator>Geeya</dc:creator>
    <dc:date>2021-09-22T19:36:52Z</dc:date>
    <item>
      <title>After several iteration of filter and union, the data is bigger than spark.driver.maxResultSize</title>
      <link>https://community.databricks.com/t5/data-engineering/after-several-iteration-of-filter-and-union-the-data-is-bigger/m-p/14962#M9367</link>
      <description>&lt;P&gt;The process for me to build model is:&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;filter dataset and split into two datasets&lt;/LI&gt;&lt;LI&gt;fit model based on two datasets &lt;/LI&gt;&lt;LI&gt;union two datasets&lt;/LI&gt;&lt;LI&gt;repeat 1-3 steps&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;The problem is that after several iterations, the model fitting time becomes longer dramatically, and the I got error message: org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 9587 tasks (4.0 GB) is bigger than spark.driver.maxResultSize (4.0 GB). But in fact the data columns and rows stay the same. &lt;/P&gt;&lt;P&gt;As the model fitting time also increases, I don't think increasing spark.driver.maxResultSize will solve this problem. Any suggestion? Thanks.&lt;/P&gt;</description>
      <pubDate>Wed, 22 Sep 2021 19:36:52 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/after-several-iteration-of-filter-and-union-the-data-is-bigger/m-p/14962#M9367</guid>
      <dc:creator>Geeya</dc:creator>
      <dc:date>2021-09-22T19:36:52Z</dc:date>
    </item>
    <item>
      <title>Re: After several iteration of filter and union, the data is bigger than spark.driver.maxResultSize</title>
      <link>https://community.databricks.com/t5/data-engineering/after-several-iteration-of-filter-and-union-the-data-is-bigger/m-p/14963#M9368</link>
      <description>&lt;P&gt;I assume that you are using PySpark to train a model? It sounds like you are collecting data on the driver and likely need to increase the size. Can you share any code?&lt;/P&gt;</description>
      <pubDate>Wed, 22 Sep 2021 20:11:44 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/after-several-iteration-of-filter-and-union-the-data-is-bigger/m-p/14963#M9368</guid>
      <dc:creator>Ryan_Chynoweth</dc:creator>
      <dc:date>2021-09-22T20:11:44Z</dc:date>
    </item>
  </channel>
</rss>

