cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Community Articles
Dive into a collaborative space where members like YOU can exchange knowledge, tips, and best practices. Join the conversation today and unlock a wealth of collective wisdom to enhance your experience and drive success.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

๐Ÿš€ Spark Caching vs Databricks Disk Caching

Coffee77
New Contributor II

As promised @BS_THE_ANALYST , in this new video and summarized in post, I try to explain what Spark Caching and Databricks Disk Caching are and how Caching strategy can be leveraged by making these cool features work together: 

Spark Caching vs Databricks Disk Caching 

Coffee77_1-1756394420204.png

Spark Caching (Memory/Disk via cache() or persist())

Scope: Spark application / job level

  • How it works: When you call .cache() or .persist() on a DataFrame/RDD, Spark materializes that dataset after the first action and keeps it in executor memory (RAM). If memory is insufficient and .persist() used it can optionally spill to disk depending on the storage level (MEMORY_ONLY, MEMORY_AND_DISK, etc.).

  • Where it lives: Inside the Spark executor JVM heap, and optionally on local disk.

  • Persistence: Data disappears when the Spark application ends, or if it is evicted due to memory pressure.

  • Best for: Reusing intermediate results across multiple actions in the same jobIterative algorithms (ML, graph processing, etc.)

Databricks Disk Caching (Before known as Delta Cache)

Scope: Cluster level

  • How it works: This is a transparent IO-level cache built into Databricks Runtime that stores data from cloud object storage (S3, ADLS, GCS) onto the local NVMe SSDs of the cluster nodes. Itโ€™s at the file block level, not tied to a Spark job.

  • Databricks disk caching can only be enabled on clusters that have local SSD storage โš ๏ธ

  • Where it lives: Outside of the JVM, on local SSDs of the Databricks cluster and managed automatically by Databricks Runtime.

  • Persistence: Survives across Spark jobs running on the same cluster, cleared when the cluster is terminated or when local SSD storage is needed for something else.

  • Best for: Repeated reads of the same files from cloud storage across different jobs or notebooks, improving read performance from Delta tables and Parquet files

  • Trigger: No code change, automatic on DBR >= 10.4, enabled via spark.databricks.io.cache.enabled true

Why Together = Best Performance

Disk caching = reduces cloud I/O latency (cluster-wide).

Spark caching = reduces recomputation overhead (job-specific).

Coffee77_0-1756394294274.jpeg

Using both ensures:

  • Faster initial reads thanks to SSD cache.

  • Faster subsequent transformations and iterative operations thanks to Spark memory and/or cache.

 

 

https://www.youtube.com/@CafeConData
0 REPLIES 0

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local communityโ€”sign up today to get started!

Sign Up Now