cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Setting checkpoint directory for checkpointInterval argument of estimators in pyspark.ml

Fed
New Contributor III

Tree-based estimators in pyspark.ml have an argument called checkpointInterval

checkpointInterval

 = Param(parent='undefined', name='checkpointInterval', doc='set checkpoint interval (>= 1) or disable checkpoint (-1). E.g. 10 means that the cache will get checkpointed every 10 iterations. Note: this setting will be ignored if the checkpoint directory is not set in the SparkContext.')

When I run sc.getCheckpointDir() I get None so I'm assuming that the setting will be ignored as stated in the doc. Would it be a wrong assumption to make?

I've tried to set one with sc.setCheckpointDir("dbfs:/path/to/checkpoint"). After fitting an estimator I checked if there were any files in it with dbutils.fs.ls(sc.getCheckpointDir()) and nothing was there.

The doc of pyspark.SparkContext.setCheckpointDir says that "The directory must be an HDFS path if running on a cluster." But am I right that a DBFS paths should work too?

Is there a way to check if the estimator is indeed checkpointing at fitting time?

1 REPLY 1

Anonymous
Not applicable

@Federico Trifoglio​ :

If sc.getCheckpointDir() returns None, it means that no checkpoint directory is set in the SparkContext. In this case, the checkpointInterval argument will indeed be ignored. To set a checkpoint directory, you can use the SparkContext.setCheckpointDir() method, as you have already done. Note that the checkpoint directory must be an HDFS path if running on a cluster. However, Databricks DBFS is a HDFS-compatible file system, so you should be able to use a DBFS path as well.

Regarding your question about checking whether the estimator is checkpointing at fitting time, you can use the explainParams() method to display the current parameter settings of the estimator, including the checkpointInterval value. For example:

from pyspark.ml.classification import RandomForestClassifier
 
rf = RandomForestClassifier(checkpointInterval=10)
 
print(rf.explainParams())

This will output a list of all the parameter values of the RandomForestClassifier, including the checkpointInterval. If the value is set to a positive integer, such as 10 in this example, it means that checkpointing is enabled and will occur every 10 iterations. You can also check the sparkContext.getCheckpointDir() to see if the checkpoint directory has been properly set.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.