Databricks Community

jack_zaldivar · ‎04-22-2025

Howdy!

Welcome to my first blog!

As an Identity System Administrator, I would say that a good portion of my day was spent configuring Okta and other identity systems. As I have moved more into an Identity Engineering role, analytics has become more important to the decision making process. My identity system is Okta, and Okta doesn't make it very clear to understand some of the more interesting datapoints, like:

How are users assigned to each application?
Which users are accessing applications?
Which applications might be over-provisioned?
How are users interacting with Okta when logging into their systems?

My goal for this blog is to walk you, The Reader, through importing your Okta configuration data into a Databricks workspace and analyzing that data via Queries, Dashboards, and AI/BI conversations.

Requirements

In order to successfully follow along with this blog series, you should have a couple of things available to you.

A Databricks account/workspace
An Okta account with Administrative permissions to generate an API token.
- While it is possible to use OAuth for authentication, that will be covered in a later post. As we begin, we'll start with an API token.
- If you do not have an Okta account to work with, or you don't want to use your production account while working through these blogs, you can always sign up for a free developer instance: https://developer.okta.com/signup
While this series will be geared towards the implementation of Okta system configurations ingestion to Databricks, this same method can be used for ANY system with API endpoints available!

For the sake of brevity, I will assume that you already have a Databricks and Okta account at your disposal. If that's the case, then let's jump right in!

***Disclaimer: I am not a Data Engineer or Data Scientist. I am an Okta Administrator. That being said, many of the concepts discussed in this blog can likely be done much better than what I have done. I have learned this platform on my own, and fumbled through the process throughout the way, as I needed. PLEASE feel free to make this series better with your insights!

Secret Management

If you're working with APIs, you likely have a token that you need to utilize to hit the API endpoints you want to ingest to Databricks. You don't want to store these tokens in plain text within a Notebook. So, let's store them securely within Databricks Secrets.

Install the Databricks CLI. I won't go into detail on how to install the Databricks CLI. There are many walkthroughs and discussions on how to install the Databricks CLI. I suggest you follow one of those guides.
Configure the CLI for authentication. Again, because there are other resources that detail how to accomplish this task, I will defer to those guides.
Create a Secret Scope. Secret Scopes can be considered as containers that allow you to share multiple secrets with other users that all reside within the same scope/container. Feel free to follow the guide on creating a secret scope.
Save your secret. The previous article discusses how to save and reference your secret in a Notebook. So, I won't go into detail for that either. I am only adding it as a step here for completeness of the walkthrough.

Your First Notebook!

Now that we've done all of the prep work, we finally get to the exciting part; your first Notebook! We're going to start very simply, and import our Okta users to Databricks. As we progress through this series, you will notice a pattern with data ingestion from an API endpoint.

Save your token
Build a Notebook to utilize the token and ingest a single datapoint, via pagination if necessary
Save the json object as a table

Let's assume your secret scope is called shared_scope and your API token is saved as okta_secret. Create a Notebook cell with the following Python code:

import os.path
import json
import math
from datetime import date, datetime

from pyspark.sql.functions import *

#get today's date for historical records
today = date.today()

# get the secret that was created via CLI (scope="shared_scope", key="okta_key")
okta_secret = dbutils.secrets.getBytes('shared_scope', 'okta-key')

Now that we have the secret in the Notebook, we need to decrypt it. I usually make a new cell for the decryption process.

try:
  okta_key = okta_secret.decode()
except:
  print("Invalid JSON string:", okta_secret.decode())
  print("Error:", e)

Next, we'll paginate through the Okta users list. I know there is an SDK for most Okta operations, but I don't think everything is covered in the SDK, and I'm honestly not sure how they handle rate limiting within the SDK, so I usually like to just build my own calls instead of relying on the SDK. This is likely one of those steps that can be improved, but it works for me, so I've just kept it as is.

import requests

# Instantiating with a Python dictionary in the constructor
config = {
    'orgUrl': 'https://my-okta-org.okta.com',
}
okta_header = {"Authorization": 'SSWS ' + okta_key}

# paginate through to collect all users
all_users=[]
ret= requests.get(config['orgUrl'] + '/api/v1/users?search=status pr', headers=okta_header)
nextUrl=ret.headers['link'].split(',')
nextUrl=nextUrl[1].split(';')
nextUrl=nextUrl[0].replace('>','').replace('<', '')
users=ret.json()
all_users=ret.json()
print('Got ' + str(len(all_users)) + ' users')
at_end=False
while at_end==False:
  ret=requests.get(nextUrl, headers=okta_header)
  all_users+=ret.json()
  if ',' in ret.headers['link']:
    nextUrl=ret.headers['link'].split(',')
    if ';' in nextUrl[1]:
      nextUrl=nextUrl[1].split(';')
      nextUrl=nextUrl[0].replace('>','').replace('<', '')
      print('Got ' + str(len(all_users)) + ' users')
    else:
      break
  else:
    at_end=True

At the end of this cell, we now have a collection of all users with their default and custom attributes as a JSON object. Next, I like t be a little extra and add a collection date to the data. This way, I can see the shift in data between specific dates or over time.

new_coll = []
for one in all_users:
    # Create a copy of the dictionary to avoid modifying the original
    updated_one = one.copy()
    # Add the new key-value pair
    updated_one['as_of_date'] = str(today)
    # Append the updated dictionary to the list
    new_coll.append(updated_one)
all_users = new_coll

Now that we have all of our data, we need to define the schema for our dataframe. If your user profiles contain custom attributes, you need to define them here!

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, BooleanType, ArrayType

# Create a SparkSession
spark = SparkSession.builder.appName("OktaUser").getOrCreate()

# Define the profile attributes.
profile_attrs = StructType([
      StructField("lastName", StringType(), True),
      StructField("hireDate", StringType(), True),
      StructField("timezone", StringType(), True),
      StructField("manager", StringType(), True),
      StructField("city", StringType(), True),
      StructField("accountType", StringType(), True),
      StructField("secondEmail", StringType(), True),
      StructField("title", StringType(), True),
      StructField("login", StringType(), True),
      StructField("firstName", StringType(), True),
      StructField("costCenter", StringType(), True),
      StructField("streetAddress", StringType(), True),
      StructField("countryCode", StringType(), True),
      StructField("state", StringType(), True),
      StructField("department", StringType(), True),
      StructField("email", StringType(), True)
])

# Define the default schema. These attributes are universal in Okta, except as_of_date, which we added earlier
schema = StructType([
    StructField("id", StringType(), True),
    StructField("status", StringType(), True),
    StructField("created", StringType(), True),
    StructField("activated", StringType(), True),
    StructField("statusChanged", StringType(), True),
    StructField("lastLogin", StringType(), True),
    StructField("lastUpdated", StringType(), True),
    StructField("passwordChanged", StringType(), True),
    StructField("type", StructType([
      StructField("id", StringType(), True)
    ]), True),
    StructField("profile", profile_attrs, True), # here, we use the profile schema defined above
    StructField("as_of_date", StringType(), False) # Make sure to define your as_of_date so you can capture the "date of data"
])

#use the schema to cast the JSON object to a dataframe
df = spark.createDataFrame(all_users, schema) 

# Display the DataFrame (Optional)
# df.show(1, False)

# This step explodes the raw data into more columns so that it's easier to read. Notice the profile.* selection. That statement will take every attribute nested within the "profile" object and "explode" it into its own column!
data_expanded=df.select("id", "status", "created", "activated", "statusChanged", "lastLogin", "lastUpdated", "passwordChanged", "type", "profile.*", "as_of_date")
data_expanded.show(1, True) # this will display 1 row of the dataframe on the screen

Now that we have our dataframe, we need to write it to a table. We want to make sure we append the data, and NOT overwrite. This will ensure that we have yesterday's data as well as all future data for pattern analysis and other neat options later! For this example, I am saving to my personal schema. However, in a production environment, you would want to save this to a shared schema, protected by Unity Catalog permissions.

I also save both the raw data and the "exploded" data, just in case we need to do any future transforms on the raw data at a later time.

# Save raw data to table. I believe this is referred to as the "bronze" layer
df.write.option("mergeschema", "true").saveAsTable("users.jack_zaldivar.okta_users_bronze", mode="append")

# Save "exploded" data to another table. I believe this is referred to as the "silver" layer
data_expanded.write.option("mergeschema", "true").saveAsTable("users.jack_zaldivar.okta_users_silver", mode="append")

That's it! Assuming everything worked correctly, You should now have a table in your Databricks instance with all of your user data.

Schedule it!

Recall that we set an as_of_date attribute earlier. We did this because we plan to run this Notebook on a schedule and continually append the day's data to it. This will give us a running list of all users in our environment and their status/attributes at that specific point in time. Now that we have a working Notebook, let's set a schedule for the Notebook to run daily.

Take a break!

That's it for part 1 of the series! You did great! If you followed along, we were able to:

Create and manage a secure Databricks Secret Scope via CLI
Interact with a third party API endpoint via Databricks Notebooks
Save third party system data to Databricks tables

For the rest of the series, I will discuss importing more datapoints from Okta to Databricks, show how to make meaningful Dashboards for your stakeholders, and even interact with the AI/BI Genie so your less technical users can discover more about your data!

Next Part 2

Databricks Community

Databricks for Identity Systems - Part 1 (Users)

Metadata-Driven ETL Framework in Databricks (Part-1)

Top 10 query performance tuning tips for Databricks Serverless SQL

Best practices for safe data experimentation with Databricks