Runtime issue

Machine Learning

Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.

Hello,

I am working on a machine learning project. The dataset I am using has more than 5000000 rows.

I am using PySpark, and the attached screenshot is the block I used RandomForestRegressor to train the model.

It worked even though it took a pretty long time, but I was trying to run the same part again and it does not work anymore. I even let it run for a whole night but it did not even start the Spark Jobs and kept showing the message "Filtering files for query". I am using 10 features for the model, so I am wondering if it is due to the high dimensions of the features. But even then why it does not work now even though it did work before?

Even I tried with sample dataset using 10% of the total data, but it still does not work. Also, I was trying to use PCA to reduce the dimensionality but that also did not process.

I was trying to increase the number of worker nodes in the cluster, but it is not allowed because I am using Azure Databricks free trials. The Policy of my cluster is "Personal Compute". I am very new to this Databricks platform, and I am trying to figure out how to deal with these issues. I did search and tried everything that I could do but does not seem working. Can anyone please tell me if there is any way that I can work with large data and train the model with less time, or at least any suggestions for my situation?

I would very appreciate for your help!

25 KB

8 KB