Databricks Community

jack_zaldivar · ‎04-22-2025

Now that we have our Users table all set up (read Part 1 of this series, if you need to catch up), we need to add a bit more data to the mix.

Groups

The next logical piece of data, in my mind, is to grab our Okta groups. We normally assign groups to applications for access in Okta. Therefore, if you are a member of a specific group, we can say that you have access to the respective application. We will need group memberships and application information to make the full picture. However, we need to take one step at a time. So, let's get some group data into Databricks!

Start with a new Notebook, specifically for collecting Group data. I will skip the key retrieval pieces of code that were in the previous post, but feel free to review how to utilize secrets as often as needed.

For this Notebook, I'm going to import the Okta SDK, just to give it a try. I decided to be a little extra and install the modules via logic. Because I always have a message that the kernel may need to restart, I went ahead and threw that in as well.

import importlib.util
import sys

mods = ['nest_asyncio', 'okta']
restart_required = False
for mod in mods:
  spec = importlib.util.find_spec(str(mod))
  if spec is not None:
    print(f'{mod} already installed')
  else:
    %pip install {mod}
    restart_required=True

if restart_required==True:
  dbutils.library.restartPython()

Now that we've got our required modules installed, we can begin with the meat and potatoes.

#%pip install okta
import okta
import nest_asyncio
import asyncio
from okta.client import Client as OktaClient


config = {
    'orgUrl': 'https://my-okta-org.okta.com', #replace with your org url
    'token': okta_key
}

okta_client = OktaClient(config)

async def list_okta_groups():
        group_list = [] # running collection
        groups, resp, err = await okta_client.list_groups() # get the first 200 groups

        while True:
            for group in groups:
                group_list.append(group) # add each group to the collection
            if resp.has_next(): # if there are more groups to query from the system, paginate to complete your collection
                groups, err = await resp.next()
            else:
                break
        return group_list # return the collection

if __name__ == '__main__':
    nest_asyncio.apply()
    groups = asyncio.run(list_okta_groups())

Great! Now, we've got all of the groups, as a JSON object, in a variable called groups. We want to continue our practice of adding today's date to the collection, so we can always look back to see what our environment looked like on any given day.

from datetime import date

new_coll = []
today = date.today()

for one in groups:
    # Create a copy of the dictionary to avoid modifying the original
    updated_one = one.__dict__.copy()
    # Add the new key-value pair
    updated_one['as_of_date'] = str(today)
    # unpickle the string from the variable "type". I added this recently to help with the "unpickling" process. Your mileage may vary, but my Notebook needed it to process the collection properly. I feel like this is probably something to do with the SDK, specifically. So, we'll use it sparingly going forward.
    updated_one['type'] = one.type.split(":")[0]
    # Append the updated dictionary to the list
    new_coll.append(updated_one)
groups = new_coll

Next, let's define the schema for our dataframe and cast the JSON into a dataframe using the schema!

import json
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, BooleanType, ArrayType

# Create a SparkSession
spark = SparkSession.builder.appName("OktaGroups").getOrCreate()

# Define the schema
schema = StructType([
  StructField("id", StringType(), True),
  StructField("created", StringType(), True),
  StructField("last_updated", StringType(), True),
  StructField("last_membership_updated", StringType(), True),
  StructField("object_class", StringType(), True),
  StructField("type", StringType(), True),
  StructField("profile", StructType([
    StructField("name", StringType(), True),
    StructField("description", StringType(), True)
  ])),
  StructField("as_of_date", StringType(), True),
])

#use base schema for "bronze" layer dataframe definition
df = spark.createDataFrame(groups, schema)
#explode bronze layer for silver layer dataframe definition
df_formatted = df.select("id", "created", "last_updated", "last_membership_updated", "object_class", "type", "profile.name", "profile.description", "as_of_date")

Once again, I have my raw data (bronze layer) defined as the vraiable df, and an "exploded" version of the same data (silver layer) as df_formatted.

Now, let's save the data to a table using the same logic we did for our Users table.

df.write.option("mergeSchema", "true").saveAsTable("user.jack_zaldivar.okta_groups", mode="append") 
df_formatted.write.option("mergeSchema", "true").saveAsTable("users.jack_zaldivar.okta_groups_formatted", mode="append")

That's it! You now have a table with all of your group data! Let's move on to Group Rules.

Group Rules

The setup for Group Rules export is basically the same, so we won't rehash that. We can just jump directly into the good stuff.

***Note: I do have this as a separate Notebook so that I can run it as an independent job.

#%pip install okta
import okta
import nest_asyncio
import asyncio
from okta.client import Client as OktaClient


config = {
    'orgUrl': 'https://my-okta-org.okta.com', # replace with your actual org url
    'token': okta_key
}

okta_client = OktaClient(config)

async def list_okta_group_rules():
        rule_list = []
        rules, resp, err = await okta_client.list_group_rules()
        print(err)
        while True:
            for rule in rules:
                rule_list.append(rule)
            if resp.has_next():
                rules, err = await resp.next()
            else:
                break
        return rule_list

if __name__ == '__main__':
    nest_asyncio.apply()
    rules = asyncio.run(list_okta_group_rules())

Once again, I'm taking advantage of the SDK, against my better judgement. However, it gets the job done, so I haven't changed it...for now. Don't forget to add our date!

from datetime import date

new_coll = []
today = date.today()

for one in rules:
    # Create a copy of the dictionary to avoid modifying the original
    updated_one = one.__dict__.copy()
    # Add the new key-value pair
    updated_one['as_of_date'] = str(today)
    # Append the updated dictionary to the list
    new_coll.append(updated_one)
rules = new_coll

Always define your schema for the JSON object

import json
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, BooleanType, ArrayType

# Create a SparkSession
spark = SparkSession.builder.appName("OktaRules").getOrCreate()

# Define the schema
schema = StructType([
  StructField("actions", StructType([
    StructField("assign_user_to_groups", StructType([
      StructField("group_ids", ArrayType(StringType(), True), True)
    ]))
  ])),
  StructField("conditions", StructType([
    StructField("expression", StructType([
      StructField("type", StringType(), True),
      StructField("value", StringType(), True)
    ]))
    #StructField("people", ArrayType(StringType())))
  ])),
  StructField("created", StringType(), True),
  StructField("id", StringType(), True),
  StructField("last_updated", StringType(), True),
  StructField("name", StringType(), True),
  StructField("status", StringType(), True),
  StructField("type", StringType(), True),
  StructField("as_of_date", StringType(), True)
])

df = spark.createDataFrame(rules, schema)
df_formatted = df.select(explode("actions.assign_user_to_groups.group_ids"), col("conditions.expression.type").alias("expressionType"), "conditions.expression.value", "created", "id", "last_updated", "name", "status", "type", "as_of_date")

Great! We've got our Group Rules schema defined and a bronze and silver layer dataframe defined! Let's dump these to their respective tables and call it a day!

df.write.option("mergeSchema", "true").mode("append").saveAsTable("users.jack_zaldivar.okta_group_rules") 
df_formatted.write.option("mergeSchema", "true").mode("append").saveAsTable("users.jack_zaldivar.okta_group_rules_formatted")

Perfect! We're building our data sets! So far, we've got Users, Groups, and Group Rules all exported from Okta to Databricks! Pat yourself on the back! you're doing great!

Back to Part 1

Databricks Community

Databricks for Identity Systems - Part 2 (Groups and Group Rules)

Metadata-Driven ETL Framework in Databricks (Part-1)

Top 10 query performance tuning tips for Databricks Serverless SQL

Best practices for safe data experimentation with Databricks