02-02-2023 01:39 AM
I am facing the below error while fitting my model. I am trying to run a model with cross validation with a pipeline inside of it. Below is the code snippet for data transformation:
qd = QuantileDiscretizer(relativeError=0.01, handleInvalid="error", numBuckets=4,
inputCols=["time"], outputCols=["time_qd"])
#Normalize Vector
scaler = StandardScaler()\
.setInputCol ("vectorized_features")\
.setOutputCol ("features")
#Encoder for VesselTypeGroupName
encoder = StringIndexer(handleInvalid='skip')\
.setInputCols (["type"])\
.setOutputCols (["type_enc"])
#OneHot encoding categorical variables
encoder1 = OneHotEncoder()\
.setInputCols (["type_enc", "ID1", "ID12", "time_qd"])\
.setOutputCols (["type_enc1", "ID1_enc", "ID12_enc", "time_qd_enc"])
#Assembling Variables
assembler = VectorAssembler(handleInvalid="keep")\
.setInputCols (["type_enc1", "ID1_enc", "ID12_enc", "time_qd_enc"]) \
.setOutputCol ("vectorized_features")
The total number of features after one hot encoding will not exceed 200. The model code is below:
lr = LogisticRegression(featuresCol = 'features', labelCol = 'label',
weightCol='classWeightCol')
pipeline_stages = Pipeline(stages=[qd , encoder, encoder1 , assembler , scaler, lr])
#Create Logistic Regression parameter grids for parameter tuning
paramGrid_lr = (ParamGridBuilder()
.addGrid(lr.regParam, [0.01, 0.5, 2.0])# regularization parameter
.addGrid(lr.elasticNetParam, [0.0, 0.5, 1.0])# Elastic Net Parameter (Ridge = 0)
.addGrid(lr.maxIter, [1, 10, 20])# Number of iterations
.build())
cv_lr = CrossValidator(estimator=pipeline_stages, estimatorParamMaps=paramGrid_lr,
evaluator=BinaryClassificationEvaluator(), numFolds=5, seed=42)
cv_lr_model = cv_lr.fit(train_df)
.fit method throws the below error:
I have tried increasing the driver memory (28GB ram with 8 cores) but still facing the same error. Please help what is the cause of this issue.
04-09-2023 08:26 AM
@Vishnu P :
The error you are seeing is likely due to running out of memory during the model training process. One possible solution is to reduce the number of features in your dataset by removing features that are not important or have low variance. You could also try increasing the number of partitions in your DataFrame using the repartition() method to distribute the data across more worker nodes and reduce memory usage per node.
Another thing to consider is adjusting the batch size for your data when performing the fit() operation. By default, Spark uses a batch size of 1.0, which can be too small for large datasets. You can try increasing the batch size by setting the batchSize parameter in the LogisticRegression model. For example, you could try setting it to 1000 or 10000, depending on the size of your dataset and available resources.
Finally, you could try using a distributed computing platform like Databricks Runtime, which can automatically manage memory and resources across a cluster of worker nodes. Databricks also offers features like autoscaling, which can automatically add or remove worker nodes based on demand.
04-09-2023 11:24 PM
Hi @Vishnu P
Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help.
We'd love to hear from you.
Thanks!
04-10-2023 11:21 AM
Hi @Vishnu P, could you please share the full stack trace? Also, observe how the workers memory utilizing?
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group