Databricks Community

Fed · ‎01-26-2023

Tree-based estimators in pyspark.ml have an argument called checkpointInterval

checkpointInterval

= Param(parent='undefined', name='checkpointInterval', doc='set checkpoint interval (>= 1) or disable checkpoint (-1). E.g. 10 means that the cache will get checkpointed every 10 iterations. Note: this setting will be ignored if the checkpoint directory is not set in the SparkContext.')

When I run sc.getCheckpointDir() I get None so I'm assuming that the setting will be ignored as stated in the doc. Would it be a wrong assumption to make?

I've tried to set one with sc.setCheckpointDir("dbfs:/path/to/checkpoint"). After fitting an estimator I checked if there were any files in it with dbutils.fs.ls(sc.getCheckpointDir()) and nothing was there.

The doc of pyspark.SparkContext.setCheckpointDir says that "The directory must be an HDFS path if running on a cluster." But am I right that a DBFS paths should work too?

Is there a way to check if the estimator is indeed checkpointing at fitting time?

Anonymous · ‎04-10-2023

@Federico Trifoglio :

If sc.getCheckpointDir() returns None, it means that no checkpoint directory is set in the SparkContext. In this case, the checkpointInterval argument will indeed be ignored. To set a checkpoint directory, you can use the SparkContext.setCheckpointDir() method, as you have already done. Note that the checkpoint directory must be an HDFS path if running on a cluster. However, Databricks DBFS is a HDFS-compatible file system, so you should be able to use a DBFS path as well.

Regarding your question about checking whether the estimator is checkpointing at fitting time, you can use the explainParams() method to display the current parameter settings of the estimator, including the checkpointInterval value. For example:

from pyspark.ml.classification import RandomForestClassifier
 
rf = RandomForestClassifier(checkpointInterval=10)
 
print(rf.explainParams())

This will output a list of all the parameter values of the RandomForestClassifier, including the checkpointInterval. If the value is set to a positive integer, such as 10 in this example, it means that checkpointing is enabled and will occur every 10 iterations. You can also check the sparkContext.getCheckpointDir() to see if the checkpoint directory has been properly set.

Databricks Community

Setting checkpoint directory for checkpointInterval argument of estimators in pyspark.ml

Photos

Join Us as a Local Community Builder!

Announcing the APJ Databricks Smart Business Insights Challenge: Empowering Data-Driven Decision Mak

🚀 Monthly Databricks Get Started Days – Accelerate Your Learning Journey! 🚀

Business Intelligence in the Era of AI

Virtual Learning Festival: 9 April - 30 April

Data + AI Summit 2025 — registration now open!