cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Best Practices for Optimizing Databricks Costs in Production Workloads?

Poorva21
New Contributor

Hi everyone,
I'm working on optimizing Databricks costs for a production-grade data pipeline (Spark + Delta Lake) on Azure. Iโ€™m looking for practical, field-tested strategies to reduce compute and storage spend without impacting performance.

So far, Iโ€™ve explored:

  • Auto-Optimize and Auto-Compact

  • Delta caching

  • Photon where supported

  • Spot instances (limited due to stability concerns)

Questions:

  1. What are the most impactful cost optimizations youโ€™ve applied in real-world pipelines?

  2. Do you prefer Jobs clusters or All-purpose clusters for cost efficiency?

  3. Any best practices for minimizing storage costs with Delta Lake (versioning, retention, vacuum, etc.)?

  4. How do you tune cluster size smartly to avoid over-provisioning?

  5. Any monitoring tools or dashboards you recommend for ongoing cost governance?

Any detailed recommendations, examples, or references would be super helpful.

Thanks!

1 REPLY 1

K_Anudeep
Databricks Employee
Databricks Employee

Hello @Poorva21 ,

Below are the answers to your questions:

Q1. What are the most impactful cost optimisations for production pipelines?

I have worked with multiple Cx and based on my knowledge, below are a high-level optimisations one must have:

  • The most important optimization you can do is to choose the right compute resources for your pipeline. If you are unsure or uncertain about which resource type to choose, Databricks recommends using serverless, as it auto-scales and scales to zero without requiring you to manage clusters. Doc: https://docs.databricks.com/aws/en/lakehouse-architecture/cost-optimization/
  • Always keep your tables well optimised and have a healthy table layout. Delta docs emphasise using compaction (OPTIMIZE) and reasonable partitioning/clustering to reduce small files and speed up queries, which directly cuts DBU consumption. Doc link: https://docs.databricks.com/aws/en/optimizations/
  • For UC-enabled tables, predictive optimisation is enabled to handle compaction/statistics automatically and simplify data maintenance and reduce storage costs.

Q2. Jobs clusters vs all-purpose clusters: which is more cost-efficient?

  • It is recommended to use job clusters for production pipelines. All-purpose clusters are for interactive / shared exploration, not production pipelines.
  • In my experience, with the same workload, the same node type, same region
    โ†’ Running it as a job on Jobs compute (Job cluster) costs less than running the same code on an all-purpose (interactive) cluster.
  • Additionally, with job clusters, you achieve better isolation for your jobs, which and turn enhances debugability.

Q3. How do I minimise storage costs with Delta Lake (versioning, retention, VACUUM, etc.)?

  • In Delta, you can control the data stored in your data table using three main knobs: VACUUM, delta.deletedFileRetentionDuration, and delta.logRetentionDuration. If you are familiar with your dataset, you can adjust these three parameters to store only the required data within the delta table and control your storage costs.
  • It's always important to run maintenance tasks (OPTIMIZE and VACUUM) regularly(NOT aggressively..like once a week) , to optimise the table and also to remove stale/unreferenced files from the table. As a best practice, again, the recommended approach would be to enable PO to have managed maintenance.

Q4. How do I tune cluster size smartly to avoid over-provisioning?

 

Q5. What monitoring tools or dashboards should I use for ongoing cost governance?

 

Please let me know if you have any further questions. Additionally, if you find this answer helpful, please accept it as a solution.

Anudeep