Databricks Community

Charansai · ‎07-28-2025

Hi Community,

We are working on implementing Databricks cluster policies across our organization and are seeking advice on best practices to enforce governance, security, and cost control across different environments.

We have two main teams using Databricks across multiple environments:

Data Engineering – Dev / QA / Prod
Data & Analytics – Dev / QA / Prod

Each environment has a separate Databricks workspace. Our goal is to define robust cluster policies that:

Enforce configuration standards (e.g., disallow public IPs, enforce autoscaling, fixed Spark configs)
Control costs (e.g., limit max workers/memory in dev/QA)
Ensure production stability (e.g., disallow in it scripts or spot instances in prod)
Allow safe experimentation in dev while keeping strong guardrails

Trying to decide:
1. Should we define one policy per team per environment (e.g., data-engineering, analytics) or have general reusable policies for each environment type?
2. What are common policy restrictions used in Dev/QA vs. Prod?
  (e.g., disallowing public IPs, enforcing autoscaling, limiting worker sizes, etc.)
3. Are there any example templates or reusable patterns followed in other large organizations?
4. Any tips for balancing developer flexibility with platform governance?
5. Please differentiate between data engineers and data analytics across all environments and provide the code for it.
We appreciate any advice, templates, or governance experiences you can share!
Thanks in advance!

Vidhi_Khaitan · ‎07-29-2025

Hi Team,

I believe these are a few suggestions that can help!
Start with environment-based policies: dev, qa, prod
These policies define the broadest guardrails (security, cost control, stability)
Add team-specific variants only if required
For example: prod cluster policy is shared, unless Data Engineering needs special Spark config

Use one base policy per environment, and define optional team-specific overlays when needed

Below are a few sample templates for policies -

{
  "name": "qa-shared-policy",
  "definition": {
    "spark_version": { "type": "fixed", "value": "<DBR>" },
    "node_type_id": {
      "type": "allowlist",
      "values": ["Standard_D4s_v3"]
    },
    "autoscale.min_workers": { "type": "fixed", "value": 2 },
    "autoscale.max_workers": { "type": "fixed", "value": 6 },
    "enable_elastic_disk": { "type": "fixed", "value": true },
    "init_scripts": { "type": "forbidden" },
    "aws_attributes.availability": { "type": "fixed", "value": "SPOT" },
    "custom_tags.environment": { "type": "fixed", "value": "qa" }
  }
}

{
  "name": "prod-data-engineering",
  "definition": {
    "spark_version": { "type": "fixed", "value": "<DBR>" },
    "node_type_id": {
      "type": "allowlist",
      "values": ["Standard_D8s_v3"]
    },
    "autoscale.min_workers": { "type": "fixed", "value": 2 },
    "autoscale.max_workers": { "type": "fixed", "value": 10 },
    "enable_elastic_disk": { "type": "fixed", "value": true },
    "init_scripts": { "type": "forbidden" },
    "aws_attributes.availability": { "type": "fixed", "value": "ON_DEMAND" },
    "data_security_mode": { "type": "fixed", "value": "USER_ISOLATION" },
    "custom_tags.team": { "type": "fixed", "value": "data-eng" },
    "custom_tags.environment": { "type": "fixed", "value": "prod" }
  }
}

Each environment typically enforces a distinct set of restrictions based on its purpose. In Dev and QA, policies often allow greater flexibility to support experimentation and testing. Spot instances, for instance, are usually allowed in Dev to reduce cost, while in QA they might be optional depending on workload criticality. Public IPs are typically disallowed in all environments to maintain network security. Dev clusters generally enforce small, cost-effective node types with autoscaling enabled and worker limits kept low (e.g., 1–4 workers). Init scripts are usually permitted in Dev for experimentation but are tightly controlled or disabled altogether in QA and disallowed in Prod to ensure production stability.

In contrast, Prod policies are much more restrictive. Spot instances and user-defined init scripts are usually disabled to ensure reliability and reduce the risk of unexpected behavior. Node types are limited to high-performance, stable instances, and autoscaling is still enabled, but with a higher upper bound to handle larger workloads. Runtime versions are often pinned and reviewed to ensure compatibility and security, and data security modes are enforced (e.g., USER_ISOLATION or TABLE_ACL when using Unity Catalog). Additionally, mandatory tagging (such as team, environment, cost_center) is enforced across all environments to support cost attribution, auditing, and governance.

Please refer to these documentations below as well -
https://docs.databricks.com/aws/en/security
https://docs.databricks.com/aws/en/data-governance/unity-catalog

Hope this helps!