cancel
Showing results for 
Search instead for 
Did you mean: 
Technical Blog
Explore in-depth articles, tutorials, and insights on data analytics and machine learning in the Databricks Technical Blog. Stay updated on industry trends, best practices, and advanced techniques.
cancel
Showing results for 
Search instead for 
Did you mean: 
dkushari
Databricks Employee
Databricks Employee

Authors: Abhishek Pratap (@aps) & Dipankar Kushari (@dkushari)

In this blog, we explore how to synchronize nested groups in Databricks from your organization’s identity provider - Azure Active Directory.

How to Sync nested Azure AD groups to Databricks

System for Cross-domain Identity Management, or SCIM, is an open standard that allows you to automate user provisioning in Databricks. SCIM lets you use an identity provider (IdP) to create users in Databricks, give them the proper level of access, and remove access (deprovision them) when they leave your organization or no longer need access to Databricks. You can use a SCIM provisioning connector in your IdP or invoke the Identity and Access Management SCIM APIs to manage provisioning. You can also use these APIs to manage identities in Databricks directly, without an IdP. 

Azure Active Directory (Azure AD) is a cloud-based identity and access management service that enables your employees’ access and single sign-on to external resources, such as Microsoft 365, the Azure portal, and applications such as Databricks. You can set up provisioning to Databricks using Azure Active Directory (Azure AD) at the Databricks account level or at the Databricks workspace level.

Single sign-on (SSO), on the other hand, enables you to authenticate your users using your organization’s identity provider. If your identity provider supports the SAML 2.0 protocol (or, in the case of account-level SSO, the OIDC protocol), you can use Databricks SSO to integrate with your identity provider. SSO makes it easy to centrally manage access to Databricks resources and business applications instead of having to sign in to Databricks using separate user credentials. With SSO enabled, users can access Databricks with their corporate credentials. This delivers a better user experience without the need to manage separate sets of credentials.

How Azure AD is setup in organizations

One of the important tasks as a Databricks Administrator for your organization when configuring access in Databricks is to integrate and sync your corporate Users and Groups from Azure AD into your Databricks account. 

Generally, the user and group mapping is well thought through and reflects the complex organizational structure. Often you may have nested groups (such as a department can have sub-departments, i.e., a parent group representing a department has a child group(s) representing sub-departments) defined in your organization and you need to bring these Users and Groups into a Databricks account with their hierarchical relationships maintained.

There are certain advantages of keeping Databricks group structure the same as your Azure AD group structure 

  1. Maintaining the same user and group structure as in Azure AD keeps the data security implementation the same as the rest of the organization.​
  2. Organization audit controls work seamlessly across the entire org if group and user implementation is the same across all toolsets and data platforms 
  3. For many organizations, this is a must-to-have​ for security compliance purposes

dkushari_0-1694092943137.png

                           FIG 1: Sample Azure AD Group Structure in large organizations

Challenges

But there are a few challenges to provisioning nested Azure AD groups in a Databricks account.

  • The Microsoft provided enterprise application does not support automatic provisioning of nested groups to any Azure AD Enterprise app.
  • Azure Active Directory can only read and provision users that are immediate members of the explicitly assigned group. 

These challenges limit an organization to sync their multiple levels of groups into Databricks as they need to restructure their Users and Groups for Databricks. Moreover, the nested groups may have a variable depth which needs a flexible solution to traverse the nested groups in a recursive manner such that a parent group is synced along with all its direct members and all of their child members.

If you have nested Azure AD groups in your organization that you want to sync with your Databricks account then you can follow this post, where we show you how you can seamlessly Sync nested Azure AD groups to Databricks with a few lines of Python code and overcome the limitations in core Azure AD sync infrastructure.

Solution

To sync nested groups from your Azure Active Directory to your Databricks account, we have put together a solution described below. This utility allows you to sync Users and Groups, including nested Users and Groups, from Azure AD to Databricks. The code for the solution is available on the GitHub repository.

Note: This is a custom solution (provided as-is) that replaces the Microsoft Enterprise Application.

Before you are ready to run the steps mentioned below, acquire the code and provision the required compute to run it.

  1. Step 1 - Register an application in Azure and grant Read permissions to the required Users and Groups.
  2. Step 2 - Get  Databricks SCIM details and prepare a config file.
  3. Step 3 - Load the above config in the “nested-aad-scim-connector” and run it. 

The connector performs the actions shown in the diagram below.

  1. Acquires an access token from Azure AD using its own identity 
  2. Calls the Microsoft Graph API to retrieve Uses and Groups
  3. Call Databricks API to find outUsers and Groups
  4. Analyze and find Users and Groups that need to be added/removed on Databricks
  5. Add/Remove Users and Groups

dkushari_1-1694092943191.png

FIG 2: Sync nested groups into Databricks

How to configure Azure AD for Databricks

Details of each step on configuration are mentioned below.

Step 1 - Register an application in Azure AD with ReadAll permissions.

Note - You will need to register an application in Azure Active Directory to enable user authentication

Follow the steps below to do the same:

  1. Open a browser, navigate to the Azure Active Directory admin center, and log in using a personal account (aka: Microsoft Account) or Work or School Account.
  2. Select Azure Active Directory in the left-hand navigation, then select App registrations under Manage.

dkushari_2-1694092943116.png

FIG 3: Sample Azure AD Group Structure in large organizations

  1. Select New registration. Enter a name for your application, for example, CustomAADConnector.
  2. Set Supported account types as desired.
  3. Leave Redirect URI empty.
  4. Select Register. On the application's Overview page, copy the value of the** Application (client) ID** and save it, you will need it in the next step. If you chose Accounts in this organizational directory only for Supported account types, copy and save the Directory (tenant) ID.

dkushari_3-1694092943182.png

  1. (Optional-required only when you will run Sync from outside your Azure network)Select Authentication under Manage. Locate the Advanced settings section and change the **Allow public client flows toggle to Yes, then choose Save.

dkushari_4-1694092943078.png

  1. In the Application menu blade, click on the Certificates & secrets, in the Client secrets section, choose New client secret:
    1. Type a key description (for instance, app secret)
    2. Select a key duration as per your security concerns
    3. The generated key value will be displayed when you click the Add button. Copy the generated value for use in the steps later.

Note - You'll need this key in your code's configuration files later. This key value will not be displayed again and is not retrievable by any other means, so make sure to save this key from the Azure portal before navigating away to any other screen or blade.

  1. In the Application menu blade, click on the API permissions on the left to open the page where we add access to the Apis that your application needs.
    1. Click the Add a permission button and then,
    2. Ensure that the Microsoft APIs tab is selected
    3. In the Commonly used Microsoft APIs section, click on Microsoft Graph
    4. In the Application permissions section, ensure that the right permissions are checked: User.ReadAll
    5. Select the Add permissions button at the bottom.

  2. At this stage, the permissions are assigned correctly, but since the client app does not allow users to interact, they cannot consent to these permissions. To get around this problem, we'd let the tenant administrator consent on behalf of all users in the tenant. Click the Grant admin consent for {tenant} button, and then select Yes when you are asked if you want to grant consent for the requested permissions for all accounts in the tenant. You need to be the tenant admin to carry out this operation.

Step 2 - Get Databricks SCIM details and prepare a config file.

After App registration, the User needs Databricks SCIM details and prepare a config file. The template is here. User needs to populate:

  1. clientId if the registered app
  2. clientSecret for the app
  3. Azure tenant Id
  4. Databricks Scim token and URL as documented here

dkushari_5-1694092943124.png

Step 3 - Load the above config in the nested “nested-aad-scim-connector” and run it

by extending the python app or reuse the PYPI utility 

Detailed code can be found at this github location.

Running the app

To run the solution, follow the steps mentioned below:

You can run this as a Standalone Python app. Follow the instructions below.

  1. Install utility via pip

pip install nestedaaddb

  1. Copy the config.cfg.template, populate details and rename the file to config.cfg
  2. Run as below:

from nestedaaddb.nested_groups import SyncNestedGroups

sn = SyncNestedGroups()

sn.loadConfig(<<Path of config.cfg>>")

sn.sync(<<Top level Group>>,<<Is Dry Run>>)

<<Top level Group>> : Denotes the top level group in AAD to sync from

<<Is Dry Run>> : Denotes if it is Dry Run.It will only print the Users and Groups to be added but will not create them.

Source Code : Github

Warning: The provided code is offered on an "as-is" basis without any guarantee or warranty. It is strongly recommended to exercise caution and thoroughly test the code in your test environment before using it in a production environment.

Nested groups in Databricks

You can view your group and its members(i.e. Users and Groups) in the account console groups tab. An example of such a nested group synced from Azure AD is shown below, where the parent group has another group called child as its member.

dkushari_6-1694092943169.png

Conclusion

In this blog, we explored how to synchronize nested groups in Databricks from your organization’s identity provider - Azure Active Directory. Try this solution today to sync your nested groups from Azure AD into your Databricks Account. You can refer to this video for step by step guidance on how to sync nested Azure AD groups to Databricks.

Here are some related links for your reference - 

Databricks Workspace Administration – Best Practices for Account, Workspace and Metastore Admins

Unity Catalog Onboarding

Manage users, service principals, and groups

Call to Action

Try out this solution today to sync your nested Users and Groups from Azure AD into your Databricks Account. You can refer to this video for step by step guidance on how to sync nested Azure AD groups to Databricks.

Related Blogs

1 Comment
StanislavRupec
New Contributor

 

Thank you for this guide. Just one small issue that is in the guide I ran into. You need to add the API MS Graph permissions also for Group.Read.all and get a consent for it, otherwise you will get an error in the script in line 58 (group = self.graph.getGroupByName(toplevelgroup))

The error is not very clear, but when I printed the group (print("Group" + str(group)) there was an error regarding insufficient permissions:
{'error': {'code': 'Authorization_RequestDenied', 'message': 'Insufficient privileges  to complete the operation.', 'innerError': {'date': '9999-99-99T00:00:00', 'request-id': '00000000-0000-0000-0000-000000000000', 'client-request-id': '00000000-0000-0000-0000-000000000000'}}}

After adding the correct API permissions, it now works.