Hello,
I am working on a machine learning project. The dataset I am using has more than 5000000 rows.
I am using PySpark, and the attached screenshot is the block I used RandomForestRegressor to train the model.
It worked even though it took a pretty long time, but I was trying to run the same part again and it does not work anymore. I even let it run for a whole night but it did not even start the Spark Jobs and kept showing the message "Filtering files for query". I am using 10 features for the model, so I am wondering if it is due to the high dimensions of the features. But even then why it does not work now even though it did work before?
Even I tried with sample dataset using 10% of the total data, but it still does not work. Also, I was trying to use PCA to reduce the dimensionality but that also did not process.
I was trying to increase the number of worker nodes in the cluster, but it is not allowed because I am using Azure Databricks free trials. The Policy of my cluster is "Personal Compute". I am very new to this Databricks platform, and I am trying to figure out how to deal with these issues. I did search and tried everything that I could do but does not seem working. Can anyone please tell me if there is any way that I can work with large data and train the model with less time, or at least any suggestions for my situation?
I would very appreciate for your help!