cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

When to use cache vs checkpoint?

User16752240150
New Contributor II

I've seen .cache() and .checkpoint() used similarly in some workflows I've come across. What's the difference, and when should I use one over the other?

1 REPLY 1

Srikanth_Gupta_
Valued Contributor

Caching is extremely useful than checkpointing when you have lot of available memory to store your RDD or Dataframes if they are massive.

Caching will maintain the result of your transformations so that those transformations will not have to be recomputed again when additional transformations is applied on RDD or Dataframe, when you apply Caching Spark stores history of transformations applied and re compute them in case of insufficient memory, but when you apply checkpointing spark throws away all of your transformations and stores finally Dataframe into HDFS forever. the main problem of checkpointing is to store the data into HDFS which is slower than caching. you also need to setup checkpointing location on HDFS. persist(StorageLevel.DISK_ONLY) also has does similar thing but it stores history of your transformations. Checkpointing is mainly used in stateful transformation that combine data across multiple batches. In such transformations, the generated RDDs depend on RDDs of previous batches, which causes the length of the dependency chain to keep increasing with time. To avoid such unbounded increases in recovery time 

checkpointing is also used in streaming application to store meta data to recover from failures.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.