cancel
Showing results for 
Search instead for 
Did you mean: 
Machine Learning
Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.
cancel
Showing results for 
Search instead for 
Did you mean: 

pyspark: Stage failure due to One hot encoding

Vish1
New Contributor II

I am facing the below error while fitting my model. I am trying to run a model with cross validation with a pipeline inside of it. Below is the code snippet for data transformation:

qd = QuantileDiscretizer(relativeError=0.01, handleInvalid="error", numBuckets=4, 
    inputCols=["time"], outputCols=["time_qd"])
 
    #Normalize Vector
    scaler = StandardScaler()\
             .setInputCol ("vectorized_features")\
             .setOutputCol ("features")
 
    #Encoder for VesselTypeGroupName
    encoder = StringIndexer(handleInvalid='skip')\
        .setInputCols (["type"])\
        .setOutputCols (["type_enc"])
 
    #OneHot encoding categorical variables
    encoder1 = OneHotEncoder()\
        .setInputCols (["type_enc", "ID1", "ID12", "time_qd"])\
        .setOutputCols (["type_enc1", "ID1_enc", "ID12_enc", "time_qd_enc"])
 
    #Assembling Variables
    assembler = VectorAssembler(handleInvalid="keep")\
             .setInputCols (["type_enc1", "ID1_enc", "ID12_enc", "time_qd_enc"]) \
             .setOutputCol ("vectorized_features")

The total number of features after one hot encoding will not exceed 200. The model code is below:

lr = LogisticRegression(featuresCol = 'features', labelCol = 'label', 
    weightCol='classWeightCol')
    pipeline_stages = Pipeline(stages=[qd , encoder, encoder1 , assembler , scaler, lr])
    #Create Logistic Regression parameter grids for parameter tuning
    paramGrid_lr = (ParamGridBuilder()
                 .addGrid(lr.regParam, [0.01, 0.5, 2.0])# regularization parameter
                 .addGrid(lr.elasticNetParam, [0.0, 0.5, 1.0])# Elastic Net Parameter (Ridge = 0)
                 .addGrid(lr.maxIter, [1, 10, 20])# Number of iterations
                 .build())
    cv_lr = CrossValidator(estimator=pipeline_stages, estimatorParamMaps=paramGrid_lr, 
                        evaluator=BinaryClassificationEvaluator(), numFolds=5, seed=42)
    cv_lr_model = cv_lr.fit(train_df)

.fit method throws the below error:

image 

I have tried increasing the driver memory (28GB ram with 8 cores) but still facing the same error. Please help what is the cause of this issue.

3 REPLIES 3

Anonymous
Not applicable

@Vishnu P​ :

The error you are seeing is likely due to running out of memory during the model training process. One possible solution is to reduce the number of features in your dataset by removing features that are not important or have low variance. You could also try increasing the number of partitions in your DataFrame using the repartition() method to distribute the data across more worker nodes and reduce memory usage per node.

Another thing to consider is adjusting the batch size for your data when performing the fit() operation. By default, Spark uses a batch size of 1.0, which can be too small for large datasets. You can try increasing the batch size by setting the batchSize parameter in the LogisticRegression model. For example, you could try setting it to 1000 or 10000, depending on the size of your dataset and available resources.

Finally, you could try using a distributed computing platform like Databricks Runtime, which can automatically manage memory and resources across a cluster of worker nodes. Databricks also offers features like autoscaling, which can automatically add or remove worker nodes based on demand.

Anonymous
Not applicable

Hi @Vishnu P​ 

Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. 

We'd love to hear from you.

Thanks!

shyam_9
Valued Contributor

Hi @Vishnu P​, could you please share the full stack trace? Also, observe how the workers memory utilizing?

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group