cancel
Showing results for 
Search instead for 
Did you mean: 
Machine Learning
cancel
Showing results for 
Search instead for 
Did you mean: 

Runtime issue

choi_2
New Contributor II

Hello,

I am working on a machine learning project. The dataset I am using has more than 5000000 rows. 

I am using PySpark, and the attached screenshot is the block I used RandomForestRegressor to train the model.

It worked even though it took a pretty long time, but I was trying to run the same part again and it does not work anymore. I even let it run for a whole night but it did not even start the Spark Jobs and kept showing the message "Filtering files for query". I am using 10 features for the model, so I am wondering if it is due to the high dimensions of the features. But even then why it does not work now even though it did work before? 

Even I tried with sample dataset using 10% of the total data, but it still does not work. Also, I was trying to use PCA to reduce the dimensionality but that also did not process. 

I was trying to increase the number of worker nodes in the cluster, but it is not allowed because I am using Azure Databricks free trials. The Policy of my cluster is "Personal Compute". I am very new to this Databricks platform, and I am trying to figure out how to deal with these issues. I did search and tried everything that I could do but does not seem working. Can anyone please tell me if there is any way that I can work with large data and train the model with less time, or at least any suggestions for my situation?

I would very appreciate for your help!

1 ACCEPTED SOLUTION

Accepted Solutions

Kaniz
Community Manager
Community Manager

Hi @choi_2 , Hello! It sounds like you’re dealing with some challenges while working on your machine learning project. 

 

Handling large datasets can indeed be tricky, especially when memory constraints and processing time come into play. 

 

Let’s explore some strategies to address your situation:

 

Allocate More Memory:

  • Some machine learning tools or libraries may be limited by default memory configurations. Check if you can reconfigure your tool or library to allocate more memory. For instance, in Weka, you can increase memory as a parameter when starting the application.

Work with a Smaller Sample:

  • Consider whether you truly need to work with the entire dataset. Taking a random sample (e.g., the first 1,000 or 100,000 rows) can help you explore algorithms and check results quickly. Once you’re satisfied, you can fit the final model using all the data (using progressive data loading techniques).

Use a Computer with More Memory:

  • If possible, access a larger computer with significantly more memory. Cloud services like Amazon Web Services (AWS) offer machines with tens of gigabytes of RAM for a reasonable cost. This approach can be beneficial for large-scale computations.

Change the Data Format:

  • If your data is stored in raw ASCII text (e.g., CSV files), consider using a more memory-efficient format. Binary formats like GRIB, NetCDF, or HDF can speed up data loading and reduce memory usage. You can transform data formats using command-line tools without loading the entire dataset into memory.

Stream Data or Use Progressive Loading:

  • Explore algorithms that can learn iteratively using optimization techniques (e.g., stochastic gradient descent). These methods allow you to stream or progressively load data as needed for training. Linear regression and logistic regression implementations often require all data in memory, but other algorithms can handle progressive loading.

Remember that each situation is unique, and the best approach depends on your specific requirements and constraints. Experiment with these suggestions and see which one works best for your project. 

 

Good luck, and feel free to ask if you need further assistance! 🚀🔍

View solution in original post

1 REPLY 1

Kaniz
Community Manager
Community Manager

Hi @choi_2 , Hello! It sounds like you’re dealing with some challenges while working on your machine learning project. 

 

Handling large datasets can indeed be tricky, especially when memory constraints and processing time come into play. 

 

Let’s explore some strategies to address your situation:

 

Allocate More Memory:

  • Some machine learning tools or libraries may be limited by default memory configurations. Check if you can reconfigure your tool or library to allocate more memory. For instance, in Weka, you can increase memory as a parameter when starting the application.

Work with a Smaller Sample:

  • Consider whether you truly need to work with the entire dataset. Taking a random sample (e.g., the first 1,000 or 100,000 rows) can help you explore algorithms and check results quickly. Once you’re satisfied, you can fit the final model using all the data (using progressive data loading techniques).

Use a Computer with More Memory:

  • If possible, access a larger computer with significantly more memory. Cloud services like Amazon Web Services (AWS) offer machines with tens of gigabytes of RAM for a reasonable cost. This approach can be beneficial for large-scale computations.

Change the Data Format:

  • If your data is stored in raw ASCII text (e.g., CSV files), consider using a more memory-efficient format. Binary formats like GRIB, NetCDF, or HDF can speed up data loading and reduce memory usage. You can transform data formats using command-line tools without loading the entire dataset into memory.

Stream Data or Use Progressive Loading:

  • Explore algorithms that can learn iteratively using optimization techniques (e.g., stochastic gradient descent). These methods allow you to stream or progressively load data as needed for training. Linear regression and logistic regression implementations often require all data in memory, but other algorithms can handle progressive loading.

Remember that each situation is unique, and the best approach depends on your specific requirements and constraints. Experiment with these suggestions and see which one works best for your project. 

 

Good luck, and feel free to ask if you need further assistance! 🚀🔍

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.