Re: Tuning `CrossValidator` spark job performance

cchalc · ‎05-11-2022

Hello @Assaad Mrad ,

So this looks like trying to decide between putting the pipeline in the cross validator or the cross validator in the pipeline. Since you are doing the polynomial expansion as part of the pipeline you might want to consider putting the CV in the pipeline since it does not need to be refit each time.

So something like:

cv = CrossValidator(estimator=gbt, evaluator=evaluator, estimatorParamMaps=paramGrid, 
                    numFolds=3, parallelism=3, seed=42)
 
stagesWithCV = [assembler, px, standardScalar cv]
pipeline = Pipeline(stages=stagesWithCV)
 
pipelineModel = pipeline.fit(trainDF)

The safest way is to put the pipeline inside the CV to prevent any data leakage. But if that is not a concern then you can get some performance improvements this way.

View solution in original post