Databricks Community

alejandrofm · ‎03-31-2022

Hi, I'm doing some something simple on Databricks notebook:

spark.sparkContext.setCheckpointDir("/tmp/")
 
import pyspark.pandas as ps
 
sql=("""select 
field1, field2
From table
Where date>='2021-01.01""")
 
df = ps.sql(sql)
df.spark.checkpoint()

That runs great, saves the rdd on /mp/ then I want to save the df with

df.to_csv('/FileStore/tables/test.csv', index=False)

or

df1.spark.coalesce(1).to_csv('/FileStore/tables/test.csv', index=False)

And it recalculates the query again (it first did it on the checkpoint and then again to save the file).

What i'm doing wrong? currently, to solve this I'm saving the first dataframe without checkpoint, opening again and saving with coalesce.

If I use the coalesce(1) directly it doesn't parallelize.

EDIT:

Tried

df.spark.cache()

But still reprocesses when I try to save to CSV, I'm looking to avoid reprocessing and avoid saving twice. Thanks!

the question is, why it recalculates df1 after the checkpoint?

Thanks!

Hubert-Dudek · ‎05-03-2022

If you need checkpointing, please try the below code. Thanks to persist, you will avoid reprocessing:

df = ps.sql(sql).persist()
df.spark.checkpoint()

View solution in original post

Hubert-Dudek · ‎03-31-2022

Please use localCheckpoint(True) so it will be stored on executors and trigger immediately.

alejandrofm · ‎03-31-2022

@Hubert Dudek , No luck with that, how do you use it on a ps dataframe?

Why do you think it doesn't work saving to DBFS?

Thanks!

Hubert-Dudek · ‎04-04-2022

path should be directory in to_csv, not file as one file = 1 partition
try checkpoint(eager=True)
use df.spark.explain() before and after checkpointing
checkpointing saving files to disk requires disk and computation and removes RDD from memory. So then, when you read it from disk, it requires recomputation. I think it doesn't make sense. I used checkpoint only once with some UDF function that made REST API calls and needed to have that executed in that place of code. Ddatabricks/spark using lazy evaluation and many optimizations of your code. Sometimes you need to do a checkpoint, so it will not do it optimized way.

alejandrofm · ‎04-04-2022

Hi, i'm repartitioning to 1 because it's easier and faster later to move 1 file instead of 10k files.

What I'm looking for is the possibility to use this or similar:

df.spark.checkpoint()

and later use df.head() without recompute or to_csv without recompute, just the time it takes to merge al the calculated partitions.

Thought eager was default true, will check on that, but as what i'm looking it created the rdd file on disk but isn't using it, it recomputes the query.

Thanks!

alejandrofm · ‎04-21-2022

Hi, back here, any idea of what approach I should take if I want to do something like:

df.head()

--

df.info

--

df.to_csv

and make the computation only once and not three times

Thanks!!!

alejandrofm · ‎05-03-2022

Sorry for the bump, still don't find the proper way to do this.

Thanks!

Hubert-Dudek · ‎05-03-2022

If you need checkpointing, please try the below code. Thanks to persist, you will avoid reprocessing:

df = ps.sql(sql).persist()
df.spark.checkpoint()

annafina · ‎11-21-2024

checkpoint() returns a checkpointed DataFrame, so you need to assign it to a new variable:

checkpointedDF = df.checkpoint()

Databricks Community

Pandas.spark.checkpoint() doesn't broke lineage

Join Us as a Local Community Builder!

🚀 Weekly Delta (8 - 14 October): A Look Back at This Week’s Top Community Highlights

Databricks Community Champion - September 2025 - Nayanjyoti Sonowal

BrickCon 2025 — Dec 3–5 | A Community Conference for Databricks Builders

🌟 Community Sparks of the Week | September 26 – October 2 🌟

Solution Accelerator Series | #4 - Toxicity Detection for Gaming