Databricks Community

User16752240150 · ‎06-04-2021

I've seen .cache() and .checkpoint() used similarly in some workflows I've come across. What's the difference, and when should I use one over the other?

Srikanth_Gupta_ · ‎06-25-2021

Caching is extremely useful than checkpointing when you have lot of available memory to store your RDD or Dataframes if they are massive.

Caching will maintain the result of your transformations so that those transformations will not have to be recomputed again when additional transformations is applied on RDD or Dataframe, when you apply Caching Spark stores history of transformations applied and re compute them in case of insufficient memory, but when you apply checkpointing spark throws away all of your transformations and stores finally Dataframe into HDFS forever. the main problem of checkpointing is to store the data into HDFS which is slower than caching. you also need to setup checkpointing location on HDFS. persist(StorageLevel.DISK_ONLY) also has does similar thing but it stores history of your transformations. Checkpointing is mainly used in stateful transformation that combine data across multiple batches. In such transformations, the generated RDDs depend on RDDs of previous batches, which causes the length of the dependency chain to keep increasing with time. To avoid such unbounded increases in recovery time

checkpointing is also used in streaming application to store meta data to recover from failures.

Databricks Community

When to use cache vs checkpoint?

Solution Accelerator Series | Building a Chatbot With Large Language Models (LLMs)

DAIS 2026 Day 2 - The Main Event Opens Today

Build apps without jumping through hoops

Data+AI Summit 2026 | Share Your Moment, Own Your Spotlight!

DAIS 2026 is Almost Here — Meet Us at the Community Booth in Moscone West