Databricks Community

Poorva21 · ‎12-05-2025

Hi everyone,
I'm working on optimizing Databricks costs for a production-grade data pipeline (Spark + Delta Lake) on Azure. I’m looking for practical, field-tested strategies to reduce compute and storage spend without impacting performance.

So far, I’ve explored:

Auto-Optimize and Auto-Compact
Delta caching
Photon where supported
Spot instances (limited due to stability concerns)

Questions:

What are the most impactful cost optimizations you’ve applied in real-world pipelines?
Do you prefer Jobs clusters or All-purpose clusters for cost efficiency?
Any best practices for minimizing storage costs with Delta Lake (versioning, retention, vacuum, etc.)?
How do you tune cluster size smartly to avoid over-provisioning?
Any monitoring tools or dashboards you recommend for ongoing cost governance?

Any detailed recommendations, examples, or references would be super helpful.

Thanks!

K_Anudeep · ‎12-05-2025

Hello @Poorva21 ,

Below are the answers to your questions:

Q1. What are the most impactful cost optimisations for production pipelines?

I have worked with multiple Cx and based on my knowledge, below are a high-level optimisations one must have:

The most important optimization you can do is to choose the right compute resources for your pipeline. If you are unsure or uncertain about which resource type to choose, Databricks recommends using serverless, as it auto-scales and scales to zero without requiring you to manage clusters. Doc: https://docs.databricks.com/aws/en/lakehouse-architecture/cost-optimization/
Always keep your tables well optimised and have a healthy table layout. Delta docs emphasise using compaction (OPTIMIZE) and reasonable partitioning/clustering to reduce small files and speed up queries, which directly cuts DBU consumption. Doc link: https://docs.databricks.com/aws/en/optimizations/
For UC-enabled tables, predictive optimisation is enabled to handle compaction/statistics automatically and simplify data maintenance and reduce storage costs.

Q2. Jobs clusters vs all-purpose clusters: which is more cost-efficient?

It is recommended to use job clusters for production pipelines. All-purpose clusters are for interactive / shared exploration, not production pipelines.
In my experience, with the same workload, the same node type, same region
→ Running it as a job on Jobs compute (Job cluster) costs less than running the same code on an all-purpose (interactive) cluster.
Additionally, with job clusters, you achieve better isolation for your jobs, which and turn enhances debugability.

Q3. How do I minimise storage costs with Delta Lake (versioning, retention, VACUUM, etc.)?

In Delta, you can control the data stored in your data table using three main knobs: VACUUM, delta.deletedFileRetentionDuration, and delta.logRetentionDuration. If you are familiar with your dataset, you can adjust these three parameters to store only the required data within the delta table and control your storage costs.
It's always important to run maintenance tasks (OPTIMIZE and VACUUM) regularly(NOT aggressively..like once a week) , to optimise the table and also to remove stale/unreferenced files from the table. As a best practice, again, the recommended approach would be to enable PO to have managed maintenance.

Q4. How do I tune cluster size smartly to avoid over-provisioning?

I believe you can simply follow the Documentation: https://docs.databricks.com/aws/en/compute/cluster-config-best-practices?#compute-sizing-considerati... This contains all the information you need for this question

Q5. What monitoring tools or dashboards should I use for ongoing cost governance?

Databricks recommends using system tables + prebuilt dashboards for this. Doc: https://docs.databricks.com/aws/en/admin/system-tables/
We also have usage dashboards for this Doc: https://docs.databricks.com/aws/en/admin/account-settings/usage

Please let me know if you have any further questions. Additionally, if you find this answer helpful, please accept it as a solution.

Anudeep

View solution in original post

K_Anudeep · ‎12-05-2025

Hello @Poorva21 ,

Below are the answers to your questions:

Q1. What are the most impactful cost optimisations for production pipelines?

I have worked with multiple Cx and based on my knowledge, below are a high-level optimisations one must have:

The most important optimization you can do is to choose the right compute resources for your pipeline. If you are unsure or uncertain about which resource type to choose, Databricks recommends using serverless, as it auto-scales and scales to zero without requiring you to manage clusters. Doc: https://docs.databricks.com/aws/en/lakehouse-architecture/cost-optimization/
Always keep your tables well optimised and have a healthy table layout. Delta docs emphasise using compaction (OPTIMIZE) and reasonable partitioning/clustering to reduce small files and speed up queries, which directly cuts DBU consumption. Doc link: https://docs.databricks.com/aws/en/optimizations/
For UC-enabled tables, predictive optimisation is enabled to handle compaction/statistics automatically and simplify data maintenance and reduce storage costs.

Q2. Jobs clusters vs all-purpose clusters: which is more cost-efficient?

It is recommended to use job clusters for production pipelines. All-purpose clusters are for interactive / shared exploration, not production pipelines.
In my experience, with the same workload, the same node type, same region
→ Running it as a job on Jobs compute (Job cluster) costs less than running the same code on an all-purpose (interactive) cluster.
Additionally, with job clusters, you achieve better isolation for your jobs, which and turn enhances debugability.

Q3. How do I minimise storage costs with Delta Lake (versioning, retention, VACUUM, etc.)?

In Delta, you can control the data stored in your data table using three main knobs: VACUUM, delta.deletedFileRetentionDuration, and delta.logRetentionDuration. If you are familiar with your dataset, you can adjust these three parameters to store only the required data within the delta table and control your storage costs.
It's always important to run maintenance tasks (OPTIMIZE and VACUUM) regularly(NOT aggressively..like once a week) , to optimise the table and also to remove stale/unreferenced files from the table. As a best practice, again, the recommended approach would be to enable PO to have managed maintenance.

Q4. How do I tune cluster size smartly to avoid over-provisioning?

I believe you can simply follow the Documentation: https://docs.databricks.com/aws/en/compute/cluster-config-best-practices?#compute-sizing-considerati... This contains all the information you need for this question

Q5. What monitoring tools or dashboards should I use for ongoing cost governance?

Databricks recommends using system tables + prebuilt dashboards for this. Doc: https://docs.databricks.com/aws/en/admin/system-tables/
We also have usage dashboards for this Doc: https://docs.databricks.com/aws/en/admin/account-settings/usage

Please let me know if you have any further questions. Additionally, if you find this answer helpful, please accept it as a solution.

Anudeep

Databricks Community

Best Practices for Optimizing Databricks Costs in Production Workloads?

The Next Wave of Enterprise AI | Webinar

🌟 Community Pulse: Your Weekly Roundup! June 29 – July 05, 2026

📌‌ Complete Your Profile – Help Others Get to Know You

Solution Accelerator Series | Identify Fraud With Geospatial Analytics and AI

Databricks Community Champion - June 2026 - Amira Bedhiafi