Databricks Community

Coffee77 · ‎08-28-2025

As promised @BS_THE_ANALYST , in this new video and summarized in post, I try to explain what Spark Caching and Databricks Disk Caching are and how Caching strategy can be leveraged by making these cool features work together:

Spark Caching vs Databricks Disk Caching

Spark Caching (Memory/Disk via cache() or persist())

Scope: Spark application / job level

How it works: When you call .cache() or .persist() on a DataFrame/RDD, Spark materializes that dataset after the first action and keeps it in executor memory (RAM). If memory is insufficient and .persist() used it can optionally spill to disk depending on the storage level (MEMORY_ONLY, MEMORY_AND_DISK, etc.).
Where it lives: Inside the Spark executor JVM heap, and optionally on local disk.
Persistence: Data disappears when the Spark application ends, or if it is evicted due to memory pressure.
Best for: Reusing intermediate results across multiple actions in the same job, Iterative algorithms (ML, graph processing, etc.)

Databricks Disk Caching (Before known as Delta Cache)

Scope: Cluster level

How it works: This is a transparent IO-level cache built into Databricks Runtime that stores data from cloud object storage (S3, ADLS, GCS) onto the local NVMe SSDs of the cluster nodes. It’s at the file block level, not tied to a Spark job.
Databricks disk caching can only be enabled on clusters that have local SSD storage ⚠️
Where it lives: Outside of the JVM, on local SSDs of the Databricks cluster and managed automatically by Databricks Runtime.
Persistence: Survives across Spark jobs running on the same cluster, cleared when the cluster is terminated or when local SSD storage is needed for something else.
Best for: Repeated reads of the same files from cloud storage across different jobs or notebooks, improving read performance from Delta tables and Parquet files
Trigger: No code change, automatic on DBR >= 10.4, enabled via spark.databricks.io.cache.enabled true

Why Together = Best Performance

Disk caching = reduces cloud I/O latency (cluster-wide).

Spark caching = reduces recomputation overhead (job-specific).

Using both ensures:

Faster initial reads thanks to SSD cache.
Faster subsequent transformations and iterative operations thanks to Spark memory and/or cache.

Lifelong Learner Cloud & Data Solution Architect | https://www.youtube.com/@CafeConData

Databricks Community

🚀 Spark Caching vs Databricks Disk Caching

Spark Caching (Memory/Disk via cache() or persist())

Databricks Disk Caching (Before known as Delta Cache)

Join Us as a Local Community Builder!

🌟 Community Pulse: Your Weekly Roundup! November 21 – 27, 2025

Join us for another BrickTalk: Vibe-Coding Databricks Apps in Replit with Augusto!

Celebrating Our First Brickster Champion: Louis Frolio

⭐ Setup Spark with Hadoop Anywhere : A DBR aligned local Spark+HDFS+Hive stack on Docker⭐

Big Book of Data Engineering - Get how-tos, code snippets and real-world examples