cancel
Showing results for 
Search instead for 
Did you mean: 
Machine Learning
Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.
cancel
Showing results for 
Search instead for 
Did you mean: 

Why is spark mllib is not encouraged on the platform?/Why is ML dependent on .toPandas() on dbricks?

stochastic
New Contributor

I'm new to Spark,Databricks and am surprised about how the Databricks tutorials for ML are using pandas DF > Spark DF. Of the tutorials I've seen, most data processing is done in a distributed manner but then its just cast to a pandas dataframe. From my perspective, I was excited to use Dbricks for faster processing and training but it just feels like I'm trading off the time I gain preprocessing to eventually use `.toPandas()` somewhere

Tutorials:

- (https://docs.databricks.com/en/mlflow/end-to-end-example.html)

- Feature tables are computed in a distributed format via the FE Client, but to actually utilize them for training they are cast .toPandas() (https://docs.databricks.com/en/_extras/notebooks/source/machine-learning/feature-store-with-uc-taxi-...)

 

Overall questions:

1. Am I fundamentally misunderstanding something about ML and Databricks?

2. In the case where we do use scikit-learn and single-node compute and don't really need all the distributed compute, what is the best driver/worker setup?

0 REPLIES 0