🚀 Spark Caching vs Databricks Disk Caching

Coffee77 — Thu, 28 Aug 2025 15:42:49 GMT

As promised @BS_THE_ANALYST , in this new video and summarized in post, I try to explain what Spark Caching and Databricks Disk Caching are and how Caching strategy can be leveraged by making these cool features work together:

Spark Caching vs Databricks Disk Caching

Spark Caching (Memory/Disk via cache() or persist())

Scope: Spark application / job level

How it works: When you call .cache() or .persist() on a DataFrame/RDD, Spark materializes that dataset after the first action and keeps it in executor memory (RAM). If memory is insufficient and .persist() used it can optionally spill to disk depending on the storage level (MEMORY_ONLY, MEMORY_AND_DISK, etc.).
Where it lives: Inside the Spark executor JVM heap, and optionally on local disk.
Persistence: Data disappears when the Spark application ends, or if it is evicted due to memory pressure.
Best for: Reusing intermediate results across multiple actions in the same job, Iterative algorithms (ML, graph processing, etc.)

Databricks Disk Caching (Before known as Delta Cache)

Scope: Cluster level

How it works: This is a transparent IO-level cache built into Databricks Runtime that stores data from cloud object storage (S3, ADLS, GCS) onto the local NVMe SSDs of the cluster nodes. It’s at the file block level, not tied to a Spark job.
Databricks disk caching can only be enabled on clusters that have local SSD storage ⚠️
Where it lives: Outside of the JVM, on local SSDs of the Databricks cluster and managed automatically by Databricks Runtime.
Persistence: Survives across Spark jobs running on the same cluster, cleared when the cluster is terminated or when local SSD storage is needed for something else.
Best for: Repeated reads of the same files from cloud storage across different jobs or notebooks, improving read performance from Delta tables and Parquet files
Trigger: No code change, automatic on DBR >= 10.4, enabled via spark.databricks.io.cache.enabled true

Why Together = Best Performance

Disk caching = reduces cloud I/O latency (cluster-wide).

Spark caching = reduces recomputation overhead (job-specific).

Using both ensures:

Faster initial reads thanks to SSD cache.
Faster subsequent transformations and iterative operations thanks to Spark memory and/or cache.

topic 🚀 Spark Caching vs Databricks Disk Caching in Community Articles

🚀 Spark Caching vs Databricks Disk Caching

Spark Caching (Memory/Disk via cache() or persist())

Databricks Disk Caching (Before known as Delta Cache)