3 weeks ago
Hi everyone,
We are exploring the notebooksโfirst development approach with Databricks Bundles, and weโve run into a workspaceโpermissions challenge involving Service Principals.
/Workspace/Users/<user_email>/project/notebook
A Service Principal cannot access user workspace paths such as:
/Workspace/Users/<user_email>/...
We also cannot:
So the SP has no way to read or execute the notebook, and therefore cannot run the job.
How should we structure our workspace, Git folders, or permissions so the Service Principal can run Bundleโbased jobs, without granting SP access to personal user directories?
3 weeks ago
Hi @DineshOjha,
Given your constraints (perโapplication service principals, isolation at the volume/schema level, and not wanting to use /Workspace/Shared), the flow you described aligns with how Bundles are meant to be used in production. Bundles are the recommended CI/CD mechanism, and using service principals as run identities in nonโdev targets is explicitly encouraged.
A couple of clarifications and direct answers to your questions:
1. Do you think this is a good approach for notebook based implementation or do you suggest anything else?
Yes, this is a solid pattern for notebookโbased implementations... Git serves as the source of truth, with personal workspaces intended solely for development purposes. Bundles are responsible for deploying notebooks and job definitions into the workspace. In non-development targets, the run_as parameter is configured to use the per-application service principal, ensuring that all production runs utilise that principalโs permissions. This setup includes access to the appropriate volume/schema, which is critical for maintaining consistency and security throughout the deployment process.
The only design choice you still need is where in the workspace Bundles deploy to. You donโt have to use /Workspace/Shared. You can pick any isolated path, for example /Workspace/.bundle/prod/${bundle.name} or /Workspace/Projects/<app_name>/... and lock that path down so only the application service principal, a small operator group, and optionally CI/CD deployer principals have access. The path naming is up to you. Bundles just need a root_path per target, and you control the permissions there.
So I would keep your 4โstep approach and add a perโapp workspace root (instead of /Shared), with ACLs granting access only to the relevant SP and operators.
2. The service principal exists only in Databricks, so what email and PAT should be provided to enable GIT access?
With the BundlesโfromโAzureโDevOps pattern you described, the important nuance is that your Databricks service principal does not need to talk directly to Git to make this work. In a typical Azure DevOps setup.. Azure DevOps pipelines clone the Git repo themselves using the identity configured in DevOps (service connection, PAT, or Microsoft Entraโbacked principal). Once the code is on the build agent, the pipeline calls databricks bundle validate/deploy/run using the Databricks service principal to authenticate to Databricks, not to Git.
In that model, you do not need to configure a Git email/PAT on the Databricks SP at all. Git credentials live entirely in Azure DevOps (for checking out the repo).The Databricks SP is only used for workspace authentication (via OAuth M2M, workload identity federation, or an ARM service connection). You only need Git credentials on the Databricks SP if you also want it to use Git folders / Repos in the workspace, or run Gitโbacked jobs directly from Databricks (using Gitโwithโjobs / Git folders).
In that case, the email/PAT would belong to a nonโhuman Azure DevOps identity (service principal or technical user) that has access to the repo. You then link those Git credentials to the Databricks SP via the Git integration tab in the workspace.
3. How will service principal get access to the Azure GIT repo (ADO repository)?
In a two-layer setup, the first layer involves Azure DevOps and a Git repository. In this configuration, you create a service principal or technical user with at least Basic access and repository permissions in Azure DevOps. This identity is utilised for your pipelines to check out the code, and it is managed within Azure DevOps, not in Databricks.
The second layer connects Azure DevOps to Databricks through a Databricks service principal. To set this up, you configure an Azure DevOps service connection that authenticates to Databricks using methods such as OAuth M2M, Azure Resource Manager connection, or the recommended workload identity federation (which avoids long-lived secrets). Your pipeline steps will involve commands like `databricks bundle validate -t prod`, `databricks bundle deploy -t prod`, and `databricks bundle run -t prod <job_name>`, with the Databricks CLI already authenticated as the service principal.
For a Bundles-only flow, the Databricks service principal does not require direct Git access. It simply needs to support CLI/API calls from the pipeline. However, if you want the Databricks service principal to operate on Git folders within the workspace, you must grant the DevOps identity access to the repository (Basic + repo permissions) and link its Git credentials to your Databricks service principal under the settings for Git integration with Azure DevOps (using PAT or Entra-based authentication).
4. Is there any other access that the service principal needs for this approach, for bundles etc ?
For the exact permission model and how to wire this up, the official docs cover it in more detail..
run_as configuration (how to set the SP as the run identity in targets)If this answer resolves your question, could you mark it as โAccept as Solutionโ? That helps other users quickly find the correct fix.
3 weeks ago - last edited 3 weeks ago
Hi @DineshOjha,
This is a good question, and researching this helped me learn some best practices along the way. What youโre seeing is actually expected behaviour. Service principals arenโt meant to execute notebooks directly from usersโ personal workspace paths. That limitation is by design for security and isolation reasons.
Given youโre using Databricks Bundles and a notebooksโfirst workflow, the recommended pattern is to treat Git as the source of truth. Developers can work on notebooks under their own /Workspace/Users/... paths (or locally) for convenience, then sync them to Git (via Git folders / Repos). Those copies in personal home directories should be considered development artefacts only, not what production jobs execute. In production, jobs should use notebooks deployed from Git into a shared workspace path, or reference Git directly (using jobs with a Git-based notebook source).
Instead of pointing jobs to /Workspace/Users/..., configure your bundle target so that it deploys notebooks into a shared folder where both the service principal has at least read/execute access, and your team can still inspect the deployed artefacts.
For example, in your bundle:
targets:
prod:
workspace:
host: https://<your-workspace-url>
root_path: /Workspace/Shared/projects/my-project
When you run databricks bundle deploy (ideally from CI/CD, authenticated as the service principal), the notebooks defined in the bundle are materialised under /Workspace/Shared/projects/my-project/... Your bundleโs jobs should reference those deployed notebook paths, not the originals under /Workspace/Users/....
On the Databricks side, youโll typically want
With this setup developers continue to use their personal workspace areas for development. Git remains the source of truth. And, the service principal only interacts with the shared, deployed artifacts and never needs access to /Workspace/Users/....
If you prefer to be fully Gitโcentric, you can also configure jobs to pull notebooks directly from Git (e.g. via Repos/git_source) and grant the service principal access to the Git repo, plus job permissions as above. However, the core principle is the same in both approaches... Donโt run production jobs against notebooks in /Workspace/Users/.... Use Git as the source of truth, and deploy or reference notebooks from a shared, serviceโprincipalโreadable location.
Hope that helps clarify the pattern.
Please let me know if any of the above is unclear.
If this answer resolves your question, could you mark it as โAccept as Solutionโ? That helps other users quickly find the correct fix.
3 weeks ago
Thank you so much for your response.
We dont prefer to keep the notebooks under Shared or run our jobs pointing to the Shared location. We have more than 200 applications and different teams working on them. Each application has a service principal associated with it and only the service principal has access to the specific applications volume and schema.
Based on your response, we are planning to follow the below approach.
1. Create notebooks under personal user account
2. Push the code to GIT
3. Deploy using bundles
4. In the bundles, provide the run_as AS service principal so that the jobs are owned and run using the service principal.
Questions:
1. Do you think this is a good approach for notebook based implementation or do you suggest anything else?
2. The service principal exists only in Databricks, so what email and PAT should be provided to enable GIT access? 3. How will service principal get access to the Azure GIT repo (ADO repository)?
4. Is there any other access that the service principal needs for this approach, for bundles etc ?
3 weeks ago
Hi @DineshOjha,
Given your constraints (perโapplication service principals, isolation at the volume/schema level, and not wanting to use /Workspace/Shared), the flow you described aligns with how Bundles are meant to be used in production. Bundles are the recommended CI/CD mechanism, and using service principals as run identities in nonโdev targets is explicitly encouraged.
A couple of clarifications and direct answers to your questions:
1. Do you think this is a good approach for notebook based implementation or do you suggest anything else?
Yes, this is a solid pattern for notebookโbased implementations... Git serves as the source of truth, with personal workspaces intended solely for development purposes. Bundles are responsible for deploying notebooks and job definitions into the workspace. In non-development targets, the run_as parameter is configured to use the per-application service principal, ensuring that all production runs utilise that principalโs permissions. This setup includes access to the appropriate volume/schema, which is critical for maintaining consistency and security throughout the deployment process.
The only design choice you still need is where in the workspace Bundles deploy to. You donโt have to use /Workspace/Shared. You can pick any isolated path, for example /Workspace/.bundle/prod/${bundle.name} or /Workspace/Projects/<app_name>/... and lock that path down so only the application service principal, a small operator group, and optionally CI/CD deployer principals have access. The path naming is up to you. Bundles just need a root_path per target, and you control the permissions there.
So I would keep your 4โstep approach and add a perโapp workspace root (instead of /Shared), with ACLs granting access only to the relevant SP and operators.
2. The service principal exists only in Databricks, so what email and PAT should be provided to enable GIT access?
With the BundlesโfromโAzureโDevOps pattern you described, the important nuance is that your Databricks service principal does not need to talk directly to Git to make this work. In a typical Azure DevOps setup.. Azure DevOps pipelines clone the Git repo themselves using the identity configured in DevOps (service connection, PAT, or Microsoft Entraโbacked principal). Once the code is on the build agent, the pipeline calls databricks bundle validate/deploy/run using the Databricks service principal to authenticate to Databricks, not to Git.
In that model, you do not need to configure a Git email/PAT on the Databricks SP at all. Git credentials live entirely in Azure DevOps (for checking out the repo).The Databricks SP is only used for workspace authentication (via OAuth M2M, workload identity federation, or an ARM service connection). You only need Git credentials on the Databricks SP if you also want it to use Git folders / Repos in the workspace, or run Gitโbacked jobs directly from Databricks (using Gitโwithโjobs / Git folders).
In that case, the email/PAT would belong to a nonโhuman Azure DevOps identity (service principal or technical user) that has access to the repo. You then link those Git credentials to the Databricks SP via the Git integration tab in the workspace.
3. How will service principal get access to the Azure GIT repo (ADO repository)?
In a two-layer setup, the first layer involves Azure DevOps and a Git repository. In this configuration, you create a service principal or technical user with at least Basic access and repository permissions in Azure DevOps. This identity is utilised for your pipelines to check out the code, and it is managed within Azure DevOps, not in Databricks.
The second layer connects Azure DevOps to Databricks through a Databricks service principal. To set this up, you configure an Azure DevOps service connection that authenticates to Databricks using methods such as OAuth M2M, Azure Resource Manager connection, or the recommended workload identity federation (which avoids long-lived secrets). Your pipeline steps will involve commands like `databricks bundle validate -t prod`, `databricks bundle deploy -t prod`, and `databricks bundle run -t prod <job_name>`, with the Databricks CLI already authenticated as the service principal.
For a Bundles-only flow, the Databricks service principal does not require direct Git access. It simply needs to support CLI/API calls from the pipeline. However, if you want the Databricks service principal to operate on Git folders within the workspace, you must grant the DevOps identity access to the repository (Basic + repo permissions) and link its Git credentials to your Databricks service principal under the settings for Git integration with Azure DevOps (using PAT or Entra-based authentication).
4. Is there any other access that the service principal needs for this approach, for bundles etc ?
For the exact permission model and how to wire this up, the official docs cover it in more detail..
run_as configuration (how to set the SP as the run identity in targets)If this answer resolves your question, could you mark it as โAccept as Solutionโ? That helps other users quickly find the correct fix.
2 weeks ago
Thank you so much Ashwin, this provides a lot of clarity.
1. Where to deploy Bundles in the workspace
We plan to deploy the bundle using a service principal , so the bundle we plan to deploy under /Workspace/<service_principal>
1. Create notebooks under personal user account
2. Create jobs as .yml files to call these notebooks
3. Push the code to GIT
4. Create bundles
5. Deploy the bundle using azure pipelines using the service principal
This would deploy the bundle under the service_principal account and make it the owner of the jobs as well.
These jobs would be later executed via a seperate secheduling tool called Control-M
2. Source as Azure GIT repo vs Workspace
From your response we understand that the serive principal needs access to GIT if the source type of our jobs is GIT. But if we define jobs with source: WORKSPACE, serive principal
need not have access to GIT.
As these are 2 seperate approaches -> 1. Source type as GIT and 2. Source as Workspace . Is there a benefit of one approach over the other?
3. CI/CD using DAB
We are currently using the python wheel approach , in which we run the pytests as part of the Azure pipeline.
When we are using DAB, whats the best process to run these pytests?
In some places its mentioned that these tests need to be run as a seperate job. I didnt find a place where it defines the best practices for these
pytests when deploying notebooks using DAB
4. Notebooks vs python tasks
If we are deploying purely python script, is there a recommendation of using 1 over the other?
In a python wheel approach, we define an entry point, but dont see an option to do that with notebooks, hence need to call the main function explicity. Is that the correct approach
5. Also, for some reason the links that you provided are not opening correctly, not sure if something got changed while pasting them.
Thank you again for your support, highly appreciate you taking the time to research and respond.
Thanks
Komal
2 weeks ago
Hi @DineshOjha,
Updated links below. Will respond to your queries before the end of this week.
run_as configuration (how to set the SP as the run identity in targets)If this answer resolves your question, could you mark it as โAccept as Solutionโ? That helps other users quickly find the correct fix.
Saturday
Hi @DineshOjha,
I lost track of this... just remembered. Please see the responses below.
1. Where to deploy Bundles in the workspace
Your proposed flow is perfectly compatible with Bundles and CI/CD best practices. On the workspace location... technically, you can set the targetโs root_path to something like /Workspace/<service_principal>/<app_name> and deploy there, as long as the deploying identity (CI/CD SP) has permission to write into that path and humans who need to debug (e.g., app team) have at least read access.
The Bundles docs commonly show a pattern like /Workspace/.bundle/${bundle.target}/${bundle.name} (or a similar structured path), and then you secure that folder. So structurally, you have two good options. PerโSP home, per app subfolder (/Workspace/<service_principal>/<app_name>) and Neutral "system" root for all bundles (/Workspace/.bundle/prod/${bundle.name}).
From Databricksโ perspective, both are fine as long as the ACLs are correct. For a large estate (200+ apps), the neutral .bundle namespace tends to age better for discoverability and governance, but your perโSP approach is not wrong.. Itโs more of an orgโconvention choice.
Running the jobs from ControlโM is also fine. Youโre just triggering Databricks jobs via API, and the location of deployed assets doesnโt change that.
2. Source as Azure GIT repo vs Workspace
Your understanding is correct. If a job is configured with source = Git (Gitโwithโjobs, or Git folders), then the Databricks identity that pulls from Git (user or SP) needs Git credentials/permissions. If a job is configured with source = Workspace (tasks point at workspace notebook paths), and Azure DevOps does the git checkout and then calls databricks bundle deploy, then the Databricks service principal does not need Git access. DevOps talks to Git.... the SP only talks to Databricks.
Bundles already assume Git is your source of truth and handle deployment from the checkedโout repo into the workspace. In that model, itโs very common to use WORKSPACE as the job source (tasks reference the deployed notebook/script paths), and let Bundles + CI/CD ensure that workspace state is in sync with Git.
With the workspace source (with Bundles) approach, the simpler mental model is Git โ CI/CD โ Bundles โ workspace; jobs read workspace. You also get full power of Bundles: targets, run_as, permissions, deployment modes, etc. And, there is no need to manage Git credentials on the Databricks SP unless you also use Git folders directly.
Git source (Gitโwithโjobs) approach is most useful if you arenโt using Bundles and want jobs to pull from Git directly at run time. More limited job/task types, and job configuration itself isnโt in source control in the same way as with Bundles.
Given you are standardising on Bundles and already using Azure DevOps, you may want to consider workspace source for jobs (deployed by Bundles), and keep Git access concentrated in Azure DevOps and any interactive developer identities.
3. CI/CD using DAB
The core best practice doesnโt change with Bundles. Keep unit tests (pytest) in your CI system, close to the code. This is still the primary mechanism for fast feedback and correctness, regardless of Bundles.
What Bundles add is a good place for integration tests where you define a test/run-unit-tests job as a resource inside the bundle (for example a small job that runs a test notebook or a script calling your wheel).
The official Azure DevOps + Bundles example shows this pattern: build/test artifact, then deploy bundle, then run a test job from the bundle. So... keep pytests in Azure Pipelines as you do today, and optionally add bundleโdefined test jobs for integration/endโtoโend checks.
4. Notebooks vs python tasks
A common pattern that fits what youโre doing today... Keep all real logic in a wheel (or at least a Python package). In jobs, either run a python_wheel_task directly (no notebook at all), or use a very thin notebook that imports your wheel and calls main() with parameters.
That gives you the best of both worlds. Testability and CI friendliness from the wheel, plus optional notebook ergonomics when you want them.
Hope this helps.
If this answer resolves your question, could you mark it as โAccept as Solutionโ? That helps other users quickly find the correct fix.