Why is spark mllib is not encouraged on the platform?/Why is ML dependent on .toPandas() on dbricks?

Machine Learning

Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.

I'm new to Spark,Databricks and am surprised about how the Databricks tutorials for ML are using pandas DF > Spark DF. Of the tutorials I've seen, most data processing is done in a distributed manner but then its just cast to a pandas dataframe. From my perspective, I was excited to use Dbricks for faster processing and training but it just feels like I'm trading off the time I gain preprocessing to eventually use `.toPandas()` somewhere

Tutorials:

- (https://docs.databricks.com/en/mlflow/end-to-end-example.html)

- Feature tables are computed in a distributed format via the FE Client, but to actually utilize them for training they are cast .toPandas() (https://docs.databricks.com/en/_extras/notebooks/source/machine-learning/feature-store-with-uc-taxi-...)

Overall questions:

1. Am I fundamentally misunderstanding something about ML and Databricks?

2. In the case where we do use scikit-learn and single-node compute and don't really need all the distributed compute, what is the best driver/worker setup?

0 REPLIES 0

Photos

Upload Upload
URL URL
Saved Photos Saved Photos

Upload location

Upload location

Add Photos to Album:

New Album

Drag here to start uploading

Drag photos here or

Tap for upload options

You must install or upgrade to the latest version of Adobe Flash Player before you can upload images.