cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Pyspark. How to get best params in grid search

pmezentsev
New Contributor

Hello!

I am using spark 2.1.1 in python

(python 2.7 executed in jupyter notebook)

And trying to make grid search for linear regression parameters.

My code looks like this:

from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml import Pipeline 
pipeline = Pipeline(stages=[
                sql_transformer,
                assembler,
                lr])
paramGrid = ParamGridBuilder().addGrid(lr.solver, ["l-bfgs", "normal"]).build()
evaluator = RegressionEvaluator()
crossval = CrossValidator(estimator=pipeline,
                          estimatorParamMaps=paramGrid,
                          numFolds=3) 
cvModel = crossval.fit(train)
cvModel.avgMetrics
out[] > [887.3183210064692, 787.3183297841774]

My question is: how i can find, which set of params whitch metric to correspond?

How i can get params of best trained model?

7 REPLIES 7

Joseph_B
Databricks Employee
Databricks Employee

To match the metrics with the sets of params:

'paramGrid' is a list of Param maps; 'avgMetrics' is a list of metrics. These 2 lists have the same order, so you can just zip them together:

zip(cvModel.avgMetrics, paramGrid)

To find the best set of params:

If you have a CrossValidatorModel (after fitting a CrossValidator), then you can get the best model from the field called bestModel. You can then use extractParamMap to get the best model's parameters:

bestPipeline = cvModel.bestModel
bestLRModel = bestPipeline.stages[2]
bestParams = bestLRModel.extractParamMap()

Tried the code above, bestParams still shows a null list? any thoughts?

Tried this code, but the extractParamMap() it show some parameter but can't show the best parameter inside the paramGrid.

Joseph_B
Databricks Employee
Databricks Employee

This has been improved in Apache Spark 2.3.0 in https://issues.apache.org/jira/browse/SPARK-10931 which copies Param values into the Python wrappers around Scala types. extractParamMap() extracts all Params; you have to look within it for the Params from the grid which you really care about.

let me give you an example. After I call bestModel, I will get pyspark.ml.recommendation.ALSModel. ( which is fitted model). what I really want is pyspark.ml.recommendation.ALS, this is why I cannot get the parameter in the model, for example alpha

shyam_9
Databricks Employee
Databricks Employee

Hi @pmezentsev,

You can build paramgrid with different vallues of parameters and then you'll get best params using GridSearchCV

param_grid = { 'n_estimators': [200, 500, 700], 'max_features': ['auto', 'sqrt', 'log2'] } ,

CV_rfc = GridSearchCV(estimator=rfc, param_grid=param_grid, cv= 5)

phamyen
New Contributor II

This is a great article. It gave me a lot of useful information. thank you very much download app

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group