Databricks Community

arunwagle · ‎05-01-2025

Summary

Managing Databricks infrastructure at scale can quickly become a complex, time-consuming effort, especially when deploying hundreds of workspaces with unique configurations across teams and environments.

The challenge? Replicating environments securely, consistently, and efficiently—without the overhead of constantly tweaking Terraform modules or manually managing infrastructure components.

Here’s how teams can scale infrastructure deployments, boost productivity, and ensure governance—all while freeing up time to focus on delivering business value by leveraging the existing Databricks Terraform modules and Security Reference Architecture templates underneath the implementation.

The Problem: Manual Deployments Are Slowing You Down

Organizations face recurring pain points when scaling Databricks environments on Azure:

Deployment Delays: Manual provisioning takes weeks, stalling innovation.
Inconsistency: Without standardization, dev, staging, and prod environments drift apart.
High Operational Overhead: Manual setups demand time, effort, and Terraform expertise.
Security Risks: Sharing credentials or inconsistent compliance practices leads to vulnerabilities.
Poor Scalability: Manual processes can’t keep up with dynamic business needs.

Who will benefit from this?

Spinning up a new Databricks workspace shouldn’t feel like a fire drill—but for many teams, it still does. Manual setups, inconsistent configs, and security concerns slow down delivery and drain resources before any data pipeline even runs.

Infrastructure teams often spend hours writing custom scripts to get customers started, an effort that doesn't scale.

Developers are stuck waiting on infrastructure, losing valuable time they could spend building.

DevOps teams carry the burden of enforcing standards and security, often with manual, error-prone processes.

Security teams worry about shared credentials and unclear separation of responsibilities.

The result? Frustration across the board—and a slower path to value.

The project will address these teams’ pain points, enabling faster innovation, greater productivity, and improved collaboration across teams.

The Solution: A Scalable, Configuration-Driven Approach

To address these challenges, organizations must shift to a configuration-driven infrastructure model that removes complexity while enabling rapid, repeatable deployments.

Flow Diagram

Aira - Currenlty implemented architecture .png

Here’s what that looks like in action:

1. No/Low-Code Deployment

Use configuration files, not custom Terraform scripts for repeatable workspace creation, including support for public and private link architectures on Azure.

Example config: The below configuration creates a sandbox and a simplified private link to Databricks workspaces

{
 "config": [
   {
     "name": "aira_adb_sandbox_workspace",
     "create_resource_group": true,
     "region": "eastus",
     "rg_name": "aira-adb-sandbox-rg",
     "tags": {
       "environment": "sandbox",
       "department": "AIRA",
       "created-by": "arun.wagle@databricks.com",
       "project-use-case": "Databricks for POC work",
       "create-date": "20250227"
     },
     "type": "az_adb_public"
   },
   {
     "name": "aira_adb_pl_simplified_workspace",
     "is_auth_workspace": false,
     "create_private_dns_zone": true,
     "vnet_name": "aira-adb-eastus-mldevstage-vnet",
     "network_rg_name": "aira-adb-mldevstage-rg",
     "sg_name": "aira_eastus_databricks_mldevstage_internal_nsg",
     "transit_public_subnet_name": "databricks_aira_mldevstage_external",
     "transit_private_subnet_name": "databricks_aira_mldevstage_internal",
     "transit_pl_subnet_name": "databricks_aira_mldevstage_privatelink",
     "rg_name": "aira-adb-mldevstage-rg",
     "private_endpoint_sub_resource_name": "databricks_ui_api",
     "region": "eastus",
     "tags": {
       "environment": "dev",
       "department": "AIRA",
       "created-by": "arun.wagle@databricks.com",
       "project-use-case": "Databricks for MLDevStage",
       "create-date": "20250227"
     },
     "type": "az_adb_pl_simplified"
   },
   {
     "name": "aira_adb_web_auth_DND_workspace_eastus",
     "is_auth_workspace": true,     
     "vnet_name": "aira-adb-eastus-mldevstage-vnet",
     "network_rg_name": "aira-adb-mldevstage-rg",
     "sg_name": "aira_eastus_databricks_mldevstage_internal_nsg",
     "transit_public_subnet_name": "databricks_aira_mldevstage_external",
     "transit_private_subnet_name": "databricks_aira_mldevstage_internal",
     "transit_pl_subnet_name": "databricks_aira_mldevstage_privatelink",
     "rg_name": "aira-adb-mldevstage-rg",
     "private_endpoint_sub_resource_name": "browser_authentication",
     "region": "eastus",
     "tags": {
       "environment": "dev",
       "department": "AIRA",
       "created-by": "arun.wagle@databricks.com",
       "project-use-case": "Webauth Databricks Workspace for MLDevStage",
       "create-date": "20250227"
     },
     "type": "az_adb_pl_simplified"
   }
 ]
}

2. CI/CD Integration

Integrate with Azure Pipelines and Azure Repos to automate infrastructure updates, manage state, and ensure consistency across environments. This can be extended to any other repos of choice, like GitHub, Bitbucket, etc., and integrated with any CI/CD solutions, like Jenkins, Github actions, etc. The CI/CD configuration will be configuration-driven.

Terraform Infrastructure Workflow Diagram (2).png

Example config: The below CI/CD configuration allows users to specify input platforms, repos, key vaults configs and optional components to be created.

{ 
 "input_cloud": "azure",
 "input_cicd_platform": "azure_devops",
 "input_repo": "azure_repos",
 "az_kv_name": "kv-terraform-vault",


 "create_az_rg": "No",
 "resource_cfg_nm_az_rg": "cfg-rg-1.json",
 "module_nm_az_rg": "setup-az-create-rg",
 "operation_az_rg": "apply",
  "create_az_vnets": "No",
 "resource_cfg_nm_az_vnets": "cfg-vnets-1.json",
 "module_nm_az_vnets": "setup-az-create-vnets",
 "operation_az_vnets": "apply",


 "create_az_subnets": "No",
 "resource_cfg_nm_az_subnets": "cfg-subnets-1.json",
 "module_nm_az_subnets": "setup-az-create-subnets",
 "operation_az_subnets": "apply",


 "create_adb_workspaces": "Yes",
 "resource_cfg_nm_adb_workspaces": "cfg-all-workspaces-1.json",
 "module_nm_adb_workspaces": "setup-db-all-workspaces",
 "operation_adb_workspaces": "apply"
}

3. Modular Infrastructure

Deploy resources flexibly and at scale, whether all at once or component by component. Provision VNETs, subnets, resource groups, and Databricks workspaces seamlessly.

The below sample project structure offers a modular approach, allowing you to create all resources simultaneously or build them out component by component.

4. Automated State Management

Manage Terraform state using Azure Storage and Terraform workspaces, reducing manual intervention and the risk of misconfiguration. You can also leverage other solutions, like Terraform Cloud, for Terraform state management.

Example script(.sh file): This is integrated with the CI/CD process.

init_terraform(){
   # Check required parameters
   if [[ -z "$MODULE_NAME" || -z "$RESOURCE_CONFIG_FILE_NAME" || -z "$TERRAFORM_STATE_FILE_NAME" ]]; then
       echo "Error: Missing required arguments. Usage: run_terraform <module_folder> <config_file> <state_file>"
       exit 1
   fi

   # Change to the Terraform module directory
   cd "targets/$MODULE_NAME" || { echo "Error: Directory $MODULE_NAME not found!"; exit 1; }

   # Terraform Init (Non-interactive)
   # if [[ "$RUN_TERRAFORM_INIT" == "Yes" ]]; then
   echo "Initializing Terraform..."   

   terraform init \
       -backend-config="storage_account_name=$TERRAFORM_STORAGE_ACCT_NAME" \
       -backend-config="container_name=$TERRAFORM_CONTAINER_NAME" \
       -backend-config="resource_group_name=$TERRAFORM_STATE_RESOURCE_GROUP" \
       -backend-config="key=$TERRAFORM_STATE_FILE_NAME" || { echo "Error: Terraform initialization failed."; exit 1; }

}


run_terraform() {
   # Check required parameters
   if [[ -z "$MODULE_NAME" || -z "$RESOURCE_CONFIG_FILE_NAME" || -z "$TERRAFORM_STATE_FILE_NAME" ]]; then
       echo "Error: Missing required arguments. Usage: run_terraform <module_folder> <config_file> <state_file>"
       exit 1
   fi

   # Change to the Terraform module directory
   cd "targets/$MODULE_NAME" || { echo "Error: Directory $MODULE_NAME not found!"; exit 1; }
 
   # Terraform Workspace Setup (Non-interactive)
   # local workspace_name=${$TERRAFORM_WORKSPACE_NAME:-${RESOURCE_CONFIG_FILE_NAME%.json}}
   local workspace_name="${RESOURCE_CONFIG_FILE_NAME%.json}"
  
   echo "Selecting existing Terraform workspace: $workspace_name"
   terraform workspace select -or-create "$workspace_name" || { echo "Error: Failed to select workspace $workspace_name"; exit 1; }
  
   # Terraform Operations (Controlled via ENV VARS)
   echo "TERRAFORM_OPERATION: $TERRAFORM_OPERATION"
   case "$TERRAFORM_OPERATION" in
       plan) terraform_command="terraform plan -input=false" ;;
       apply) terraform_command="terraform apply -input=false -auto-approve" ;;
       destroy) terraform_command="terraform destroy -input=false -auto-approve" ;;
       *) echo "Invalid Terraform operation: $TERRAFORM_OPERATION"; exit 1 ;;
   esac

   # Execute Terraform operation
   echo "Executing: $terraform_command"
   eval "$terraform_command" \
       -var="config_file_name=$RESOURCE_CONFIG_FILE_NAME" \
       -var="client_id=$TERRAFORM_SP" \
       -var="client_secret=$TERRAFORM_SP_SECRET" \
       -var="tenant_id=$AZURE_TENANT" \
       -var="subscription_id=$AZURE_SUBSCRIPTION"
}

5. Security Built In

Implement Azure Key Vault for secrets management and ensure separation of concerns—DevOps teams handle credentials without exposing them to developers.

6. Monitoring and Compliance

Deploy tools like the Databricks Security Analysis Tool (SAT) to monitor environments and ensure governance and compliance with security standards.

What is the Security Analysis Tool (SAT)?

The Security Analysis Tool (SAT) is a Databricks industry solution that analyzes customers' Databricks account and workspace security configurations and provides recommendations that help them follow Databricks's security best practices. When customers run SAT, it compares their workspace configurations against a set of security best practices and delivers a report for their Databricks (AWS, Azure, and GCP) workspaces. These checks identify recommendations to harden Databricks configurations, services, and resources.

Figure: Referenced from blog-announcing-security-analysis-tool-sat

Once deployed, the SAT Dashboard displays security scan results for each workspace, sorted by severity.

Figure: Referenced from blog-announcing-security-analysis-tool-sat

The Impact: Why This Matters

99% Faster Deployments: Reduce setup from 1–2 weeks to 2–4 hours.
Lower Costs: Automate manual processes and cut down on Terraform dependencies.
Increased Productivity: Free engineers to focus on data pipelines, not infrastructure.
Improved Security & Compliance: Consistent enforcement of CI/CD policies and state management.
Scalability Without Headaches: Spin up governed, secure workspaces at scale with ease.

Take Action: Transform How You Manage Databricks Infrastructure

Scaling Databricks environments doesn’t have to mean more complexity. With a configuration-driven, automated approach, your teams can move faster, stay secure, and scale efficiently.

Next steps:

Evaluate your current deployment process and identify bottlenecks.
Explore CI/CD integration using Azure Pipelines or GitHub Actions.
Modularize infrastructure components for flexibility and scalability.
Implement secrets management using Azure Key Vault.
Introduce monitoring tools to ensure governance and compliance.

Start today—simplify your infrastructure, accelerate your deployments, and empower your teams to focus on what matters most: delivering data-driven value.

Need help getting started or want to explore implementation templates? Let’s connect.