Databricks Community

jack_zaldivar · ‎07-09-2025

Now that we have our Users, Groups, and Group Members exported to Databricks; it's time us to grab the applications that these users and groups are assigned to. Much like the previous walkthroughs, we will focus on how I created my Notebooks to query the Okta APIs for the data that I am interested in.

You may have noticed that I target a specific object type per blog entry. This is actually how I have my Notebooks set up. Each object type is it's own Notebook, so that I can schedule them independently or as a number of multi-task jobs. This is important because of 2 things:

This just happened to be how I created the Notebooks, because I didn't really have a lot of time to think about a design as I was initially learning the platform.
It made things easier (for me) to troubleshoot when something went wrong.

And with that...

Let's begin the operation!

As with our previous Notebook, let's make sure we have the modules/libraries installed for this Notebook and restart our kernel if we had to install any of the modules.

import importlib.util
import sys

mods = ['nest_asyncio', 'okta']
restart_required = False
for mod in mods:
  spec = importlib.util.find_spec(str(mod))
  if spec is not None:
    print(f'{mod} already installed')
  else:
    %pip install {mod}
    restart_required=True

if restart_required==True:
  dbutils.library.restartPython()

Shhh...It's a Secret!

I will continue placing a section on each post to remind you about secret management, but will not go into any more detail than already described previously. Just...please don't put your secrets in the Notebooks themselves. Make sure you're using Databricks Secret Management instead.

Let's Get Cooking!

Back to our normal cadence, we'll use the Okta SDK again to to collect all of the Applications and Groups assigned to those applications. We prefer to use group assignment for application access, as it's easier to manage. If you don't do this, you may need to change the script to collect assigned individuals instead.

import okta
import nest_asyncio
import asyncio
from okta.client import Client as OktaClient


config = {
    'orgUrl': 'https://my-okta-org.okta.com',
    'token': okta_key
}

okta_client = OktaClient(config)

async def get_app_groups(apps):
    app_data = []
    for app in apps:
        #print(f'checking {app}')
        app_groups, resp, err = await okta_client.list_application_group_assignments(app.id)
        while True:
            for group in app_groups:
                app_data.append({"app_id": app.id, "group_id": group.id})
            if resp.has_next():
                app_groups, err = await resp.next()
            else:
                break
    return app_data

async def list_okta_apps():
    app_list = []
    #attempting to filter for only ACTIVE apps, but that didn't work...if anyone knows of a better way to do this, please comment and let me know!
    apps, resp, err = await okta_client.list_applications(query_params={'status': 'ACTIVE'})
    #write the count to the scren so we know there's some progress while the script runs
    print(f'Got {apps.count}')
    while True:
        for app in apps:
            app_list.append(app)
        if resp.has_next():
            apps, err = await resp.next()
        else:
            break
    return app_list

if __name__ == '__main__':
    nest_asyncio.apply()
    #collect all of the apps
    apps = asyncio.run(list_okta_apps())

    #since the URL filter didn't work, we'll just remove the INACTIVE apps by looping through all apps and removing any that aren't ACTIVE
    #remove the INACTIVE apps
    active_apps = []
    for a in apps:
        if a.status == 'ACTIVE':
            active_apps.append(a)

    #get groups assigned to ACTIVE apps        
    app_groups = asyncio.run(get_app_groups(active_apps))

Add your as_of_date!

As previously mentioned, this step is optional, but I like to add today's date to the data. This way, we can always see a snapshot of what the environment looked like on any given day.

from datetime import date

new_coll = []
today = date.today()

for one in app_groups:
    # Create a copy of the dictionary to avoid modifying the original
    updated_one = one.copy()
    # Add the new key-value pair
    updated_one['as_of_date'] = str(today)
    # Append the updated dictionary to the list
    new_coll.append(updated_one)
app_groups = new_coll

Define the Schema for App to Group Mapping

Another simple schema for app groups. We really only need 2 pieces of information which will tie the Groups table (Part 2 of the series) and the Apps table (below) together.

import json
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, BooleanType, ArrayType

# Create a SparkSession
spark = SparkSession.builder.appName("OktaAppGroups").getOrCreate()

# Define the schema
schema = StructType([
  StructField("app_id", StringType(), True),
  StructField("group_id", StringType(), True),
  StructField("as_of_date", StringType(), True)
])

df = spark.createDataFrame(app_groups, schema)
df_formatted = df.select("app_id", "group_id", "as_of_date")

Write to the Table

Since there isn't really much transformation needed here, we'll just write to one table. However, to keep naming consistent, we'll still use the _formatted table name.

df_formatted.write.option("mergeSchema", "true").saveAsTable("users.jack_zaldivar.okta_prod_app_groups_formatted", mode="append")

Two for One

Since we had to get the Application list in order to get the App Groups assigned to those applications, we should take advantage of that! We have the App list in memory as active_apps. Let's add our as_of_date to that collection next. While running this next cell, I also ran into some issues where an app may not have a defined accessPolicy. Because of this, I just added a little check for each app and created a new key in the dictionary for to handle null accessPolicies.

from datetime import date

new_coll = []
today = date.today()

for one in apps:
    # Create a copy of the dictionary to avoid modifying the original
    updated_one = one.__dict__.copy()
    # Add the new key-value pair
    if 'accessPolicy' in one.links:
        #get attached policy by splitting the href and taking the last element
        tmp_pol = str(one.links['accessPolicy']['href'])
        split_pol = str(tmp_pol).split('/')
        updated_one['access_policy'] = split_pol[len(split_pol)-1]
    else:
        updated_one['access_policy'] = ''
    updated_one['as_of_date'] = str(today)
    # Append the updated dictionary to the list
    new_coll.append(updated_one)
apps = new_coll

Define our App Schema

import json
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, BooleanType, ArrayType

# Create a SparkSession
spark = SparkSession.builder.appName("OktaGroups").getOrCreate()

# Define the schema
schema = StructType([
  StructField("id", StringType(), True),
  StructField("name", StringType(), True),
  StructField("label", StringType(), True),
  StructField("status", StringType(), True),
  StructField("last_updated", StringType(), True),
  StructField("created", StringType(), True),
  StructField("accessibility", StructType([
    StructField("self_service", StringType(), True),
    StructField("error_redirect_url", StringType(), True),
    StructField("login_redirect_url", StringType(), True),
  ]), True),
  StructField("visibility", StructType([
    StructField("auto_submit_toolbar", StringType(), True),
    StructField("hide", StructType([
      StructField("i_os", BooleanType(), True),
      StructField("web", BooleanType(), True),
    ])),
  ])),
  StructField("features", ArrayType(StringType()), True),
  StructField("sign_on_mode", StringType(), True),
  StructField("access_policy", StringType(), True),
  StructField("as_of_date", StringType(), True)
])

df_apps = spark.createDataFrame(apps, schema)
df_apps_formatted = df_apps.select("id", "name", "label", "status", "last_updated", "created", "accessibility.*", "visibility.auto_submit_toolbar", "visibility.hide.*", "features", "sign_on_mode", "access_policy", "as_of_date")

Write to the Apps Table

df_apps.write.option("mergeSchema", "true").saveAsTable("users.jack_zaldivar.okta_prod_apps", mode="append") 
df_apps_formatted.write.option("mergeSchema", "true").saveAsTable("users.jack_zaldivar.okta_prod_apps_formatted", mode="append")

Databricks Community

Databricks for Identity Systems - Part 4 (Applications)

Metadata-Driven ETL Framework in Databricks (Part-1)

Top 10 query performance tuning tips for Databricks Serverless SQL

Best practices for safe data experimentation with Databricks