Data Engineering

Forum Posts

Sorted by:

by User16826992666 • Valued Contributor

06-17-2021 8:02:38 AM

3837 Views
1 replies
0 kudos

Resolved! What's the difference between SparkML and Spark MLlib?

I have heard people talk about SparkML but when reading documentation it talks about MLlib. I don't understand the difference, could anyone help me understand this?

Data Engineering

3837 Views
1 replies
0 kudos

06-17-2021 8:02:38 AM

View Replies

Latest Reply

sean_owen
Databricks Employee

06-17-2021 11:23:47 AM

0 kudos

They're not really different. Before DataFrames in Spark, older implementations of ML algorithms build on the RDD API. This is generally called "Spark MLlib". After DataFrames, some newer implementations were added as wrappers on top of the old ones ...

0 kudos

06-17-2021 11:23:47 AM

by Joseph_B • Databricks Employee

06-09-2021 5:51:24 PM

1020 Views
1 replies
0 kudos

When doing hyperparameter tuning with Hyperopt, when should I use SparkTrials? Does it work with both single-machine ML (like sklearn) and distributed ML (like Apache Spark ML)?

I want to know how to use Hyperopt in different situations:Tuning a single-machine algorithm from scikit-learn or single-node TensorFlowTuning a distributed algorithm from Spark ML or distributed TensorFlow / Horovod

Data Engineering

1020 Views
1 replies
0 kudos

06-09-2021 5:51:24 PM

View Replies

Latest Reply

Joseph_B
Databricks Employee

06-09-2021 5:56:20 PM

0 kudos

The right question to ask is indeed: Is the algorithm you want to tune single-machine or distributed?If it's a single-machine algorithm like any from scikit-learn, then you can use SparkTrials with Hyperopt to distribute hyperparameter tuning.If it's...

0 kudos

06-09-2021 5:56:20 PM

by z160896 • New Contributor II

08-06-2018 8:37:52 AM

8496 Views
2 replies
0 kudos

why spark very slow with large number of dataframe columns

scala Spark App: I have a dataset of 130x14000. I read from a parquet file with SparkSession. Then used for Spark ML Random Forest model (using pipeline). It takes 7 hours to complete! for reading the parquet file takes about 1 minute. If I implemen...

Data Engineering

8496 Views
2 replies
0 kudos

08-06-2018 8:37:52 AM

View Replies

Latest Reply

EliasHaydar
New Contributor II

08-13-2018 5:11:26 AM

0 kudos

I've already answered a similar question on StackOverflow so I'll repeat what a I said there. The following may not solve your problem completely but it should give you some pointer to start. The first problem that you are facing is the disproportio...

0 kudos

08-13-2018 5:11:26 AM

1 More Replies

Databricks Community

Resolved! What's the difference between SparkML and Spark MLlib?

When doing hyperparameter tuning with Hyperopt, when should I use SparkTrials? Does it work with both single-machine ML (like sklearn) and distributed ML (like Apache Spark ML)?

why spark very slow with large number of dataframe columns