cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Machine Learning
Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

In Spark MLlib, what is the difference between an estimator and a transformer?

User16826992666
Valued Contributor
 
1 ACCEPTED SOLUTION

Accepted Solutions

sean_owen
Databricks Employee
Databricks Employee

These terms are borrowed from scikit-learn, and the idea is the same. A transformer is just a component of a pipeline that transforms the data in some way. An estimator is also a transfomer, but one that additionally needs to be 'fit' on data before it knows how to transform.

For example, a StringTokenizer is just a transformer, because it does not need to see any data to know what to do, to tokenize strings. A machine learning model like LogisticRegression is also a transformer, because it transforms data by adding a prediction. However it has to be fit on data first before it can do so. So it is (also) an estimator.

View solution in original post

1 REPLY 1

sean_owen
Databricks Employee
Databricks Employee

These terms are borrowed from scikit-learn, and the idea is the same. A transformer is just a component of a pipeline that transforms the data in some way. An estimator is also a transfomer, but one that additionally needs to be 'fit' on data before it knows how to transform.

For example, a StringTokenizer is just a transformer, because it does not need to see any data to know what to do, to tokenize strings. A machine learning model like LogisticRegression is also a transformer, because it transforms data by adding a prediction. However it has to be fit on data first before it can do so. So it is (also) an estimator.