cancel
Showing results for 
Search instead for 
Did you mean: 
Machine Learning
Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.
cancel
Showing results for 
Search instead for 
Did you mean: 

Why is spark mllib is not encouraged on the platform?/Why is ML dependent on .toPandas() on dbricks?

stochastic
New Contributor

I'm new to Spark,Databricks and am surprised about how the Databricks tutorials for ML are using pandas DF > Spark DF. Of the tutorials I've seen, most data processing is done in a distributed manner but then its just cast to a pandas dataframe. From my perspective, I was excited to use Dbricks for faster processing and training but it just feels like I'm trading off the time I gain preprocessing to eventually use `.toPandas()` somewhere

Tutorials:

- (https://docs.databricks.com/en/mlflow/end-to-end-example.html)

- Feature tables are computed in a distributed format via the FE Client, but to actually utilize them for training they are cast .toPandas() (https://docs.databricks.com/en/_extras/notebooks/source/machine-learning/feature-store-with-uc-taxi-...)

 

Overall questions:

1. Am I fundamentally misunderstanding something about ML and Databricks?

2. In the case where we do use scikit-learn and single-node compute and don't really need all the distributed compute, what is the best driver/worker setup?

0 REPLIES 0

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group