cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

What advantage is there to Databricks caching and Spark caching?

Ryan_Chynoweth
Honored Contributor III
 
2 REPLIES 2

User16869510359
Esteemed Contributor

Delta Caching is an edge feature available in Databricks. This means it's not available in OSS Spark. Spark caching is also available in Databricks.

At a high level, Delta caching is storing the data in the data disk of the executors for repeated access. Spark caching is storing the data in memory or disk or both for repeated access. The caching mechanism is different in the way eviction and refresh are done.

A comparison is provided here :

https://docs.databricks.com/delta/optimizations/delta-cache.html#delta-and-apache-spark-caching

User16783853906
Contributor II

Delta cache accelerates data reads by creating copies of remote files in nodesโ€™ local storage using a fast intermediate data format. The data is cached automatically whenever a file has to be fetched from a remote location. Successive reads of the same data are then performed locally, which results in significantly improved reading speed.

Here are the characteristics of each type:

  • Type of stored data: The Delta cache contains local copies of remote data. It can improve the performance of a wide range of queries, but cannot be used to store results of arbitrary subqueries. The Spark cache can store the result of any subquery data and data stored in formats other than Parquet (such as CSV, JSON, and ORC).
  • Performance: The data stored in the Delta cache can be read and operated on faster than the data in the Spark cache. This is because the Delta cache uses efficient decompression algorithms and outputs data in the optimal format for further processing using whole-stage code generation.
  • Automatic vs manual control: When the Delta cache is enabled, data that has to be fetched from a remote source is automatically added to the cache. This process is fully transparent and does not require any action. However, to preload data into the cache beforehand, you can use the CACHE command. When you use the Spark cache, you must manually specify the tables and queries to cache.
  • Disk vs memory-based: The Delta cache is stored entirely on the local disk, so that memory is not taken away from other operations within Spark. Due to the high read speeds of modern SSDs, the Delta cache can be fully disk-resident without a negative impact on its performance. In contrast, the Spark cache uses memory.
  • Data refresh : The Delta cache automatically detects when data files are created or deleted and updates its content accordingly. You can write, modify, and delete table data with no need to explicitly invalidate cached data.
  • Instance support : Delta caching is not configured by default for all instance instance families. You have to validate if the instance family supports

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.