Databricks Community

jack_zaldivar · ‎04-23-2025

Welcome back!

In Part 1 of this series, we walked through the process of exporting our Okta Users to Databricks. In Part 2 of the series, we exported our Okta Groups and Group Rules. In this installment, we'll collect our Group Members, so we can start tying these tables together!

Notebook Setup

Okta, so...admittedly, the SDK is working out ok (except for the pickling issue in the last Notebook). So, I guess I'm going to try to keep it going. Let's make sure we have the modules/libraries installed for this Notebook and restart our kernel if we had to install any of the modules.

import importlib.util
import sys

mods = ['nest_asyncio', 'okta']
restart_required = False
for mod in mods:
  spec = importlib.util.find_spec(str(mod))
  if spec is not None:
    print(f'{mod} already installed')
  else:
    %pip install {mod}
    restart_required=True

if restart_required==True:
  dbutils.library.restartPython()

Don't Forget Your Secret!

I'm not going into detail with the secret management anymore, but don't forget that you'll need to retrieve it and decode is appropriately (see Part 1 for details).

Get You Groups and Members!

In order to get all group memberships, we need to first get all the groups, right? I mean, it makes sense to me, at least. So, let's do that step first, then iterate through all of the groups to collect the group memberships.

#%pip install okta
import okta
import nest_asyncio
import asyncio
from okta.client import Client as OktaClient


config = {
    'orgUrl': 'https://my-okta-org.okta.com',
    'token': okta_key
}

okta_client = OktaClient(config)

async def list_okta_groups():
        group_list = []
        groups, resp, err = await okta_client.list_groups()
        while True:
            for group in groups:
                group_list.append(group)
            if resp.has_next():
                groups, err = await resp.next()
            else:
                break
        return group_list
    
async def get_all_group_memberships(groups):
    group_data = []
    #first get all the groups
    for group in groups:
        print(f'checking {group.profile.name}: {group.id}')
        member_list = []
        members, resp, err = await okta_client.list_group_users(groupId=group.id)
        while True:
            for member in members:
                group_data.append({"group_id":group.id, "user_id":member.id, "user_login":member.profile.login})
            if resp.has_next():
                members, err = await resp.next()
            else:
                break
    return group_data
        

if __name__ == '__main__':
    nest_asyncio.apply()
    groups = asyncio.run(list_okta_groups()) # get all groups
    members = asyncio.run(get_all_group_memberships(groups)) # for each group, let's get the members

Add your as_of_date!

This step is optional, of course, but I like to add today's date to the data, so we can always see a snapshot of what the environment looked like on any given day.

from datetime import date

new_coll = []
today = date.today()

for one in members:
    # Create a copy of the dictionary to avoid modifying the original
    updated_one = one.copy()
    # Add the new key-value pair
    updated_one['as_of_date'] = str(today)
    # Append the updated dictionary to the list
    new_coll.append(updated_one)
members = new_coll

Define the Schema

This is probably one of the simplest schemas in the series. There really isn't much nested information to extract from this JSON object. In this instance, our bronze and silver layers are basically the same. I think I only kept it as both tables for consistency. /shrug

import json
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, BooleanType, ArrayType

# Create a SparkSession
spark = SparkSession.builder.appName("OktaGroupMembers").getOrCreate()

# Define the schema
schema = StructType([
  StructField("group_id", StringType(), True),
  StructField("user_id", StringType(), True),
  StructField("user_login", StringType(), True),
  StructField("as_of_date", StringType(), True)
])

df = spark.createDataFrame(members, schema)
df_formatted = df.select("group_id", "user_id", "user_login", "as_of_date")

Write to the table

As always, our last step is to write our dataframe to a table.

df.write.option("mergeSchema", "true").saveAsTable("users.jack_zaldivar.okta_group_members", mode="append") 

df_formatted.write.option("mergeSchema", "true").saveAsTable("users.jack_zaldivar.okta_group_members_formatted", mode="append")

Well done!

You've made it to the end of the next installment and now you've got your Users, Groups, Group Rules, and Group Members all imported to Databricks! Don't forget to create a Schedule so that these Notebooks will all run daily. This will give you a daily snapshot of your environment.

Databricks Community

Databricks for Identity Systems - Part 3 (Group Membership)

Metadata-Driven ETL Framework in Databricks (Part-1)

Top 10 query performance tuning tips for Databricks Serverless SQL

Best practices for safe data experimentation with Databricks