cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Spark DataFrame Checkpoint

NikosLoutas
New Contributor

Good morning,

I am having a difficulty when trying to checkpoint a PySpark DataFrame.

The DataFrame is not involved in a DLT pipeline so I am using the df.checkpoint(eager=True) command, to truncate the logical plan of df and materialize it as files within a Unity Catalog volume directory.

However, after some search, it seems that the checkpoint location needs to be an hdfs mounted directory.
I think this is deprecated in Unity Catalog and an alternative would be to write the df in the UC volume directory and then immediately read it back. 

Does anyone know if hdfs is indeed deprecated in Unity Catalog and if the alternative mentioned above is a valid one ? 

Thank you.

3 REPLIES 3

saurabh18cs
Valued Contributor III

how about mounting cloud storage?

spark.conf.set("spark.sql.streaming.checkpointLocation", "dbfs:/mnt/your-checkpoint-directory")

Hello, yes this could help, although I would like to avoid mounting

saurabh18cs
Valued Contributor III

your volume approach is also good idea