cancel
Showing results for 
Search instead for 
Did you mean: 
Administration & Architecture
Explore discussions on Databricks administration, deployment strategies, and architectural best practices. Connect with administrators and architects to optimize your Databricks environment for performance, scalability, and security.
cancel
Showing results for 
Search instead for 
Did you mean: 

Azure VM quota for databricks jobs - demand prediction

noorbasha534
Valued Contributor II

Hey folks,

a quick check -

wanted to gather thoughts on how you manage demand for azure VM quota so you don't run into quota limits issues.

In our case, we have several data domains (finance, master data, supply chain...) executing their projects in Databricks. Most of them use, say, EDSv4 family series of machine for data processing via databricks jobs (non-serverless). We have requested say for 10000 vCPUs of quota from Azure. At times, we run out of quota even though we have alert on 75% of usage. One day, the quota reaches 100% limit.

We prepared dashboards using system tables to see which data domain is consuming how many vCPUs and which actually caused the quota limit; but, this is all reactive. We also like to put restrictions on VM types using compute policies.

Asking data domains to provide information around the demand (no of jobs going live as part of a new project in a month/quarter etc..) is not going to work in our environment, don't ask me why 😂.

Curious to see how others handle the cases...

3 REPLIES 3

mark_ott
Databricks Employee
Databricks Employee

To proactively manage Azure VM quota and avoid unexpected quota limit issues, particularly when running multiple Databricks projects across different data domains, several strategies can be adopted. Your current approach (dashboards and alerts on high utilization) is a strong foundation, but moving from purely reactive to proactive quota governance involves a few more layers.

Proactive Quota Management Strategies

1. Implement Hard Quotas and Reservations

  • Azure compute policies in Databricks: Set policies to limit which VM types and how many resources each data domain can use. Policies help prevent one domain from consuming all available vCPUs and impacting others.

  • Resource groups or subscriptions per domain: Allocate quotas at the resource group or subscription level—this isolates domains and enforces caps programmatically.

2. Automated Quota Forecasting and Detection

  • Trend analysis: Use your dashboard/system tables to analyze historical peak usage by domain.

    • Build simple moving average models to forecast when you might hit certain thresholds based on growth trends per domain.

    • Trigger automated tickets or alerts if a trend points towards a breach, allowing you to open a quota increase request earlier.

3. Job Submission Controls

  • Concurrency throttling: Implement controls (via Databricks jobs API, or a wrapper/pipeline orchestrator like Azure Data Factory) to queue or throttle jobs if aggregate vCPU request would exceed an allocated cap.

  • Quotas at workspace/job cluster level: Apply Databricks cluster policies to restrict max nodes/vCPUs per job or workspace.

4. VM SKU Allow/Deny Lists

  • Use Azure Policy or Databricks compute policies to restrict which VM types can be used by which users/groups. This ensures less contention for premium SKUs and predictable demand on the quota you have.

5. Quota Increase Requests—Automate & Preempt

  • If your organization is growing, schedule periodic reviews to request quota increases preemptively, based on trend data from your dashboards.

Considerations When Communication isn't Feasible

Since you're unable to rely on manual demand forecasts from domains:

  • Focus on technical controls: automation, policy-based restrictions, and predictive analytics.

  • Use role-based access controls and compute policies to enforce “fair-share” distribution.

  • Automate notifications to stakeholders when quota pressure is detected, even if you can’t ask for explicit forecasts.


Governance Layer Example Tool/Action Purpose
Automated Alerts Azure Monitor React to high utilization
Quota Policies Databricks Compute Policies Prevent resource “hogging”
Trend Forecasting Custom scripts/dashboards + ticketing Predict & preempt quota exhaustion
VM SKU Allow Lists Azure Policy / Databricks policy Enforce usage of certain VM families
Throttling/Job Queues Data Factory, custom job wrappers Prevent overcommit-of-vCPUs automatically
 
 

Adding proactive automation around forecasting, throttling, and compute policies—without relying on manual demand input—creates a controlled environment where spikes in demand are caught and managed before causing outages.

noorbasha534
Valued Contributor II

can we have quota defined in azure databricks compute policies?

mark_ott
Databricks Employee
Databricks Employee

Yes, Azure Databricks compute policies let you define “quota-like” limits, but only within Databricks, not Azure subscription quotas themselves. You still rely on Azure’s own quota system for vCPU/VM core limits at the subscription level.​

What you can limit in compute policies

Within a compute policy, you can enforce several limits that effectively act as quotas for end users:

  • Max compute resources per user: A setting on the policy that caps how many clusters / compute resources a user can create with that policy; if the user exceeds it, new creations fail rather than auto‑terminating old clusters.​

  • Max DBUs per hour (per compute): You can constrain attributes such as dbus_per_hour or use the policy UI “Max DBUs per hour” to cap the maximum cost/size of clusters or other compute created with that policy.​

  • Instance count / size: Policy JSON can restrict node types and limit min/max worker counts, which indirectly limits per‑cluster capacity and spend.​

What you cannot do in compute policies

  • No direct Azure quota management: Compute policies cannot change or define Azure subscription‑level quotas (vCPU, VM family limits, etc.); those are still managed through the Azure Portal and support requests.​

  • No global spend cap: Policies do not act as a full tenant‑wide budget or cost cap; they only constrain compute created under each policy and do not terminate running resources when you tighten limits.​

For serverless compute, Azure Databricks also has serverless DBU/hour quotas enforced at the account/region level, which are separate from compute policies and managed via Azure support requests rather than policy JSON.​