Showing results for 
Search instead for 
Did you mean: 
Machine Learning
Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.
Showing results for 
Search instead for 
Did you mean: 

pyspark: Stage failure due to One hot encoding

New Contributor II

I am facing the below error while fitting my model. I am trying to run a model with cross validation with a pipeline inside of it. Below is the code snippet for data transformation:

qd = QuantileDiscretizer(relativeError=0.01, handleInvalid="error", numBuckets=4, 
    inputCols=["time"], outputCols=["time_qd"])
    #Normalize Vector
    scaler = StandardScaler()\
             .setInputCol ("vectorized_features")\
             .setOutputCol ("features")
    #Encoder for VesselTypeGroupName
    encoder = StringIndexer(handleInvalid='skip')\
        .setInputCols (["type"])\
        .setOutputCols (["type_enc"])
    #OneHot encoding categorical variables
    encoder1 = OneHotEncoder()\
        .setInputCols (["type_enc", "ID1", "ID12", "time_qd"])\
        .setOutputCols (["type_enc1", "ID1_enc", "ID12_enc", "time_qd_enc"])
    #Assembling Variables
    assembler = VectorAssembler(handleInvalid="keep")\
             .setInputCols (["type_enc1", "ID1_enc", "ID12_enc", "time_qd_enc"]) \
             .setOutputCol ("vectorized_features")

The total number of features after one hot encoding will not exceed 200. The model code is below:

lr = LogisticRegression(featuresCol = 'features', labelCol = 'label', 
    pipeline_stages = Pipeline(stages=[qd , encoder, encoder1 , assembler , scaler, lr])
    #Create Logistic Regression parameter grids for parameter tuning
    paramGrid_lr = (ParamGridBuilder()
                 .addGrid(lr.regParam, [0.01, 0.5, 2.0])# regularization parameter
                 .addGrid(lr.elasticNetParam, [0.0, 0.5, 1.0])# Elastic Net Parameter (Ridge = 0)
                 .addGrid(lr.maxIter, [1, 10, 20])# Number of iterations
    cv_lr = CrossValidator(estimator=pipeline_stages, estimatorParamMaps=paramGrid_lr, 
                        evaluator=BinaryClassificationEvaluator(), numFolds=5, seed=42)
    cv_lr_model =

.fit method throws the below error:


I have tried increasing the driver memory (28GB ram with 8 cores) but still facing the same error. Please help what is the cause of this issue.


Not applicable

@Vishnu P​ :

The error you are seeing is likely due to running out of memory during the model training process. One possible solution is to reduce the number of features in your dataset by removing features that are not important or have low variance. You could also try increasing the number of partitions in your DataFrame using the repartition() method to distribute the data across more worker nodes and reduce memory usage per node.

Another thing to consider is adjusting the batch size for your data when performing the fit() operation. By default, Spark uses a batch size of 1.0, which can be too small for large datasets. You can try increasing the batch size by setting the batchSize parameter in the LogisticRegression model. For example, you could try setting it to 1000 or 10000, depending on the size of your dataset and available resources.

Finally, you could try using a distributed computing platform like Databricks Runtime, which can automatically manage memory and resources across a cluster of worker nodes. Databricks also offers features like autoscaling, which can automatically add or remove worker nodes based on demand.

Not applicable

Hi @Vishnu P​ 

Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. 

We'd love to hear from you.


Valued Contributor
Valued Contributor

Hi @Vishnu P​, could you please share the full stack trace? Also, observe how the workers memory utilizing?

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!