<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic AutoMl Dataset too large in Machine Learning</title>
    <link>https://community.databricks.com/t5/machine-learning/automl-dataset-too-large/m-p/63557#M3104</link>
    <description>&lt;P&gt;Hello community,&lt;/P&gt;&lt;P&gt;i have the following problem: I am using automl to solve a regression model, but in the preprocessing my dataset is sampled to ~30% of the original amount.&lt;/P&gt;&lt;P&gt;I am using runtime 14.2 ML&amp;nbsp;&lt;/P&gt;&lt;P&gt;Driver: Standard_DS4_v2 28GB Memory 8 cores&lt;/P&gt;&lt;P&gt;Worker: Standard_DS4_v2 28GB Memory 8 cores (min 1, max 2)&lt;/P&gt;&lt;P&gt;i allready set&amp;nbsp;&lt;SPAN class=""&gt;spark.task.cpus = 8, but my dataset is still down sampled &lt;span class="lia-unicode-emoji" title=":disappointed_face:"&gt;😞&lt;/span&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;DIV class=""&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV class=""&gt;Catalog says that my Table got the folowing size:&lt;/DIV&gt;&lt;DIV class=""&gt;&lt;SPAN class=""&gt;Size:&lt;/SPAN&gt;&lt;SPAN class=""&gt;264.5MiB, 8 files&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV class=""&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV class=""&gt;&lt;SPAN class=""&gt;I dont know how it still doesnt fit.&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV class=""&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV class=""&gt;&lt;SPAN class=""&gt;Any help would be appreciated&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV class=""&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV class=""&gt;&lt;SPAN class=""&gt;Mirko&lt;/SPAN&gt;&lt;/DIV&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Wed, 13 Mar 2024 13:58:03 GMT</pubDate>
    <dc:creator>Mirko</dc:creator>
    <dc:date>2024-03-13T13:58:03Z</dc:date>
    <item>
      <title>AutoMl Dataset too large</title>
      <link>https://community.databricks.com/t5/machine-learning/automl-dataset-too-large/m-p/63557#M3104</link>
      <description>&lt;P&gt;Hello community,&lt;/P&gt;&lt;P&gt;i have the following problem: I am using automl to solve a regression model, but in the preprocessing my dataset is sampled to ~30% of the original amount.&lt;/P&gt;&lt;P&gt;I am using runtime 14.2 ML&amp;nbsp;&lt;/P&gt;&lt;P&gt;Driver: Standard_DS4_v2 28GB Memory 8 cores&lt;/P&gt;&lt;P&gt;Worker: Standard_DS4_v2 28GB Memory 8 cores (min 1, max 2)&lt;/P&gt;&lt;P&gt;i allready set&amp;nbsp;&lt;SPAN class=""&gt;spark.task.cpus = 8, but my dataset is still down sampled &lt;span class="lia-unicode-emoji" title=":disappointed_face:"&gt;😞&lt;/span&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;DIV class=""&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV class=""&gt;Catalog says that my Table got the folowing size:&lt;/DIV&gt;&lt;DIV class=""&gt;&lt;SPAN class=""&gt;Size:&lt;/SPAN&gt;&lt;SPAN class=""&gt;264.5MiB, 8 files&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV class=""&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV class=""&gt;&lt;SPAN class=""&gt;I dont know how it still doesnt fit.&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV class=""&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV class=""&gt;&lt;SPAN class=""&gt;Any help would be appreciated&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV class=""&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV class=""&gt;&lt;SPAN class=""&gt;Mirko&lt;/SPAN&gt;&lt;/DIV&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 13 Mar 2024 13:58:03 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/automl-dataset-too-large/m-p/63557#M3104</guid>
      <dc:creator>Mirko</dc:creator>
      <dc:date>2024-03-13T13:58:03Z</dc:date>
    </item>
    <item>
      <title>Re: AutoMl Dataset too large</title>
      <link>https://community.databricks.com/t5/machine-learning/automl-dataset-too-large/m-p/63784#M3118</link>
      <description>&lt;P&gt;Thank you for your detailed answer. I followed your sugestions with the following result:&lt;/P&gt;&lt;P&gt;- repartioing of the data didnt change anything&lt;/P&gt;&lt;P&gt;- i checked the metrics of the workers and the memory is indeed nearly fully used (10gig is used, nearly 17gig is cached)&lt;/P&gt;&lt;P&gt;- i do not fully understand why my relativ small dataset creates such a big memory demand, maybe it results in the amount of categorial features. One hot encoding could result in many "extra columns"&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 15 Mar 2024 08:56:25 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/automl-dataset-too-large/m-p/63784#M3118</guid>
      <dc:creator>Mirko</dc:creator>
      <dc:date>2024-03-15T08:56:25Z</dc:date>
    </item>
    <item>
      <title>Re: AutoMl Dataset too large</title>
      <link>https://community.databricks.com/t5/machine-learning/automl-dataset-too-large/m-p/64068#M3132</link>
      <description>&lt;P&gt;I am pretty sure that i know what the problem was. I had a timestamp column (with second precision) as a feature. If they get one hot encoded, the dataset can get pretty large.&lt;/P&gt;</description>
      <pubDate>Tue, 19 Mar 2024 11:50:12 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/automl-dataset-too-large/m-p/64068#M3132</guid>
      <dc:creator>Mirko</dc:creator>
      <dc:date>2024-03-19T11:50:12Z</dc:date>
    </item>
  </channel>
</rss>

