topic Re: Spark DataFrame Checkpoint in Data Engineering

Spark DataFrame Checkpoint

NikosLoutas — Wed, 02 Apr 2025 09:11:18 GMT

Good morning,

I am having a difficulty when trying to checkpoint a PySpark DataFrame.

The DataFrame is not involved in a DLT pipeline so I am using the df.checkpoint(eager=True) command, to truncate the logical plan of df and materialize it as files within a Unity Catalog volume directory.

However, after some search, it seems that the checkpoint location needs to be an hdfs mounted directory.
I think this is deprecated in Unity Catalog and an alternative would be to write the df in the UC volume directory and then immediately read it back.

Does anyone know if hdfs is indeed deprecated in Unity Catalog and if the alternative mentioned above is a valid one ?

Thank you.

Re: Spark DataFrame Checkpoint

saurabh18cs — Wed, 02 Apr 2025 11:23:32 GMT

how about mounting cloud storage?

spark.conf.set("spark.sql.streaming.checkpointLocation", "dbfs:/mnt/your-checkpoint-directory")

Re: Spark DataFrame Checkpoint

saurabh18cs — Wed, 02 Apr 2025 11:24:10 GMT

your volume approach is also good idea

Re: Spark DataFrame Checkpoint

NikosLoutas — Thu, 03 Apr 2025 06:31:05 GMT

Hello, yes this could help, although I would like to avoid mounting