cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

What is the local-ssd used for in databricks?

646901
New Contributor II

What is the use-case for local-ssd's in databricks clusters? I noticed some clusters have many Tb's worth and some have no local ssd's.

What are the pro's and con's of changing the disk size bigger and smaller?

 

According to the docs:

> The disk cache is configured to use at most half of the space available on the local SSDs provided with the worker nodes. For configuration options, see Configure the disk cache.

But what is the other half used for? Is it swap? Sorting tempfiles written during execution.

1 REPLY 1

Kaniz_Fatma
Community Manager
Community Manager

Hi @646901 , Local SSDs in Databricks clusters serve several purposes and can impact performance, cost, and scalability. 

 

Letโ€™s delve into the details:

 

Use Cases for Local SSDs:

  • Low-Latency Storage: Local SSDs provide fast, low-latency storage that is ideal for workloads requiring quick access to data. These workloads include caching, intermediate data storage, and temporary files.
  • Disk Cache: As you mentioned, Databricks uses local SSDs for disk caching. The disk cache stores frequently accessed data to speed up subsequent reads. This improves query performance by reducing the need to fetch data from slower storage (such as DBFS or external storage).
  • Temporary Files: During execution, Spark creates temporary files (e.g., during shuffling, sorting, or intermediate computations). Local SSDs are used to store these temporary files, which helps avoid bottlenecks caused by slower network storage.

Pros and Cons of Changing Disk Size:

  • Increasing Disk Size:
    • Pros:
      • More Caching: Larger local SSDs allow for more caching, which can improve query performance.
      • Reduced Spill: With more space, Spark can spill less data to slower storage (e.g., DBFS).
    • Cons:
      • Cost: Larger SSDs consume more resources (both in terms of DBUs and actual storage costs).
      • Resource Allocation: Larger SSDs reduce the available resources for other tasks (e.g., memory, CPU).
  • Decreasing Disk Size:
    • Pros:
      • Cost Savings: Smaller SSDs are more cost-effective.
      • Resource Efficiency: Less space allocated to SSDs means more resources available for other tasks.
    • Cons:
      • Reduced Caching: Smaller SSDs limit the amount of data that can be cached.
      • Increased Spill: Spark may spill more data to slower storage due to limited local space.

The Other Half of Local SSDs:

  • The other half of local SSDs not used for caching is typically reserved for temporary files generated during Spark job execution. These files include intermediate results, shuffle data, and other transient data.
  • Itโ€™s not used for swap (as swap is typically managed by the operating system and resides in RAM or disk).
  • By keeping this space available, Databricks ensures that Spark jobs have sufficient room for temporary storage without causing resource contention.

In summary, local SSDs play a crucial role in improving performance by caching data and storing temporary files. The trade-off lies in balancing cost, performance, and resource allocation based on your workload requirements.

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!