Why is spark mllib is not encouraged on the platform?/Why is ML dependent on .toPandas() on dbricks?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-26-2024 12:30 PM
I'm new to Spark,Databricks and am surprised about how the Databricks tutorials for ML are using pandas DF > Spark DF. Of the tutorials I've seen, most data processing is done in a distributed manner but then its just cast to a pandas dataframe. From my perspective, I was excited to use Dbricks for faster processing and training but it just feels like I'm trading off the time I gain preprocessing to eventually use `.toPandas()` somewhere
Tutorials:
- (https://docs.databricks.com/en/mlflow/end-to-end-example.html)
- Feature tables are computed in a distributed format via the FE Client, but to actually utilize them for training they are cast .toPandas() (https://docs.databricks.com/en/_extras/notebooks/source/machine-learning/feature-store-with-uc-taxi-...)
Overall questions:
1. Am I fundamentally misunderstanding something about ML and Databricks?
2. In the case where we do use scikit-learn and single-node compute and don't really need all the distributed compute, what is the best driver/worker setup?

