cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

When to use cache vs checkpoint?

User16752240150
New Contributor II

I've seen .cache() and .checkpoint() used similarly in some workflows I've come across. What's the difference, and when should I use one over the other?

1 REPLY 1

Srikanth_Gupta_
Databricks Employee
Databricks Employee

Caching is extremely useful than checkpointing when you have lot of available memory to store your RDD or Dataframes if they are massive.

Caching will maintain the result of your transformations so that those transformations will not have to be recomputed again when additional transformations is applied on RDD or Dataframe, when you apply Caching Spark stores history of transformations applied and re compute them in case of insufficient memory, but when you apply checkpointing spark throws away all of your transformations and stores finally Dataframe into HDFS forever. the main problem of checkpointing is to store the data into HDFS which is slower than caching. you also need to setup checkpointing location on HDFS. persist(StorageLevel.DISK_ONLY) also has does similar thing but it stores history of your transformations. Checkpointing is mainly used in stateful transformation that combine data across multiple batches. In such transformations, the generated RDDs depend on RDDs of previous batches, which causes the length of the dependency chain to keep increasing with time. To avoid such unbounded increases in recovery time 

checkpointing is also used in streaming application to store meta data to recover from failures.

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now