Spark DataFrame Checkpoint
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Wednesday
Good morning,
I am having a difficulty when trying to checkpoint a PySpark DataFrame.
The DataFrame is not involved in a DLT pipeline so I am using the df.checkpoint(eager=True) command, to truncate the logical plan of df and materialize it as files within a Unity Catalog volume directory.
However, after some search, it seems that the checkpoint location needs to be an hdfs mounted directory.
I think this is deprecated in Unity Catalog and an alternative would be to write the df in the UC volume directory and then immediately read it back.
Does anyone know if hdfs is indeed deprecated in Unity Catalog and if the alternative mentioned above is a valid one ?
Thank you.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Wednesday
how about mounting cloud storage?
spark.conf.set("spark.sql.streaming.checkpointLocation", "dbfs:/mnt/your-checkpoint-directory")
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Wednesday
Hello, yes this could help, although I would like to avoid mounting
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Wednesday
your volume approach is also good idea

