<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Runtime issue in Machine Learning</title>
    <link>https://community.databricks.com/t5/machine-learning/runtime-issue/m-p/53396#M2749</link>
    <description>&lt;P&gt;Hello,&lt;/P&gt;&lt;P&gt;I am working on a machine learning project. The dataset I am using has more than 5000000 rows.&amp;nbsp;&lt;/P&gt;&lt;P&gt;I am using PySpark, and the attached screenshot is the block I used RandomForestRegressor to train the model.&lt;/P&gt;&lt;P&gt;It worked even though it took a pretty long time, but I was trying to run the same part again and it does not work anymore. I even let it run for a whole night but it did not even start the Spark Jobs and kept showing the message "Filtering files for query". I am using 10 features for the model, so I am wondering if it is due to the high dimensions of the features. But even then why it does not work now even though it did work before?&amp;nbsp;&lt;/P&gt;&lt;P&gt;Even I tried with sample dataset using 10% of the total data, but it still does not work. Also, I was trying to use PCA to reduce the dimensionality but that also did not process.&amp;nbsp;&lt;/P&gt;&lt;P&gt;I was trying to increase the number of worker nodes in the cluster, but it is not allowed because I am using Azure Databricks free trials. The Policy of my cluster is "Personal Compute". I am very new to this Databricks platform, and I am trying to figure out how to deal with these issues. I did search and tried everything that I could do but does not seem working. Can anyone please tell me if there is any way that I can work with large data and train the model with less time, or at least any suggestions for my situation?&lt;/P&gt;&lt;P&gt;I would very appreciate for your help!&lt;/P&gt;</description>
    <pubDate>Tue, 21 Nov 2023 21:33:55 GMT</pubDate>
    <dc:creator>choi_2</dc:creator>
    <dc:date>2023-11-21T21:33:55Z</dc:date>
    <item>
      <title>Runtime issue</title>
      <link>https://community.databricks.com/t5/machine-learning/runtime-issue/m-p/53396#M2749</link>
      <description>&lt;P&gt;Hello,&lt;/P&gt;&lt;P&gt;I am working on a machine learning project. The dataset I am using has more than 5000000 rows.&amp;nbsp;&lt;/P&gt;&lt;P&gt;I am using PySpark, and the attached screenshot is the block I used RandomForestRegressor to train the model.&lt;/P&gt;&lt;P&gt;It worked even though it took a pretty long time, but I was trying to run the same part again and it does not work anymore. I even let it run for a whole night but it did not even start the Spark Jobs and kept showing the message "Filtering files for query". I am using 10 features for the model, so I am wondering if it is due to the high dimensions of the features. But even then why it does not work now even though it did work before?&amp;nbsp;&lt;/P&gt;&lt;P&gt;Even I tried with sample dataset using 10% of the total data, but it still does not work. Also, I was trying to use PCA to reduce the dimensionality but that also did not process.&amp;nbsp;&lt;/P&gt;&lt;P&gt;I was trying to increase the number of worker nodes in the cluster, but it is not allowed because I am using Azure Databricks free trials. The Policy of my cluster is "Personal Compute". I am very new to this Databricks platform, and I am trying to figure out how to deal with these issues. I did search and tried everything that I could do but does not seem working. Can anyone please tell me if there is any way that I can work with large data and train the model with less time, or at least any suggestions for my situation?&lt;/P&gt;&lt;P&gt;I would very appreciate for your help!&lt;/P&gt;</description>
      <pubDate>Tue, 21 Nov 2023 21:33:55 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/runtime-issue/m-p/53396#M2749</guid>
      <dc:creator>choi_2</dc:creator>
      <dc:date>2023-11-21T21:33:55Z</dc:date>
    </item>
  </channel>
</rss>

