Databricks

Vish1 · ‎02-02-2023

I am facing the below error while fitting my model. I am trying to run a model with cross validation with a pipeline inside of it. Below is the code snippet for data transformation:

qd = QuantileDiscretizer(relativeError=0.01, handleInvalid="error", numBuckets=4, 
    inputCols=["time"], outputCols=["time_qd"])
 
    #Normalize Vector
    scaler = StandardScaler()\
             .setInputCol ("vectorized_features")\
             .setOutputCol ("features")
 
    #Encoder for VesselTypeGroupName
    encoder = StringIndexer(handleInvalid='skip')\
        .setInputCols (["type"])\
        .setOutputCols (["type_enc"])
 
    #OneHot encoding categorical variables
    encoder1 = OneHotEncoder()\
        .setInputCols (["type_enc", "ID1", "ID12", "time_qd"])\
        .setOutputCols (["type_enc1", "ID1_enc", "ID12_enc", "time_qd_enc"])
 
    #Assembling Variables
    assembler = VectorAssembler(handleInvalid="keep")\
             .setInputCols (["type_enc1", "ID1_enc", "ID12_enc", "time_qd_enc"]) \
             .setOutputCol ("vectorized_features")

The total number of features after one hot encoding will not exceed 200. The model code is below:

lr = LogisticRegression(featuresCol = 'features', labelCol = 'label', 
    weightCol='classWeightCol')
    pipeline_stages = Pipeline(stages=[qd , encoder, encoder1 , assembler , scaler, lr])
    #Create Logistic Regression parameter grids for parameter tuning
    paramGrid_lr = (ParamGridBuilder()
                 .addGrid(lr.regParam, [0.01, 0.5, 2.0])# regularization parameter
                 .addGrid(lr.elasticNetParam, [0.0, 0.5, 1.0])# Elastic Net Parameter (Ridge = 0)
                 .addGrid(lr.maxIter, [1, 10, 20])# Number of iterations
                 .build())
    cv_lr = CrossValidator(estimator=pipeline_stages, estimatorParamMaps=paramGrid_lr, 
                        evaluator=BinaryClassificationEvaluator(), numFolds=5, seed=42)
    cv_lr_model = cv_lr.fit(train_df)

.fit method throws the below error:

I have tried increasing the driver memory (28GB ram with 8 cores) but still facing the same error. Please help what is the cause of this issue.

Anonymous · ‎04-09-2023

@Vishnu P :

The error you are seeing is likely due to running out of memory during the model training process. One possible solution is to reduce the number of features in your dataset by removing features that are not important or have low variance. You could also try increasing the number of partitions in your DataFrame using the repartition() method to distribute the data across more worker nodes and reduce memory usage per node.

Another thing to consider is adjusting the batch size for your data when performing the fit() operation. By default, Spark uses a batch size of 1.0, which can be too small for large datasets. You can try increasing the batch size by setting the batchSize parameter in the LogisticRegression model. For example, you could try setting it to 1000 or 10000, depending on the size of your dataset and available resources.

Finally, you could try using a distributed computing platform like Databricks Runtime, which can automatically manage memory and resources across a cluster of worker nodes. Databricks also offers features like autoscaling, which can automatically add or remove worker nodes based on demand.

Anonymous · ‎04-09-2023

Hi @Vishnu P

Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help.

We'd love to hear from you.

Thanks!