Howdy!
Welcome to my first blog!
As an Identity System Administrator, I would say that a good portion of my day was spent configuring Okta and other identity systems. As I have moved more into an Identity Engineering role, analytics has become more important to the decision making process. My identity system is Okta, and Okta doesn't make it very clear to understand some of the more interesting datapoints, like:
My goal for this blog is to walk you, The Reader, through importing your Okta configuration data into a Databricks workspace and analyzing that data via Queries, Dashboards, and AI/BI conversations.
Requirements
In order to successfully follow along with this blog series, you should have a couple of things available to you.
For the sake of brevity, I will assume that you already have a Databricks and Okta account at your disposal. If that's the case, then let's jump right in!
***Disclaimer: I am not a Data Engineer or Data Scientist. I am an Okta Administrator. That being said, many of the concepts discussed in this blog can likely be done much better than what I have done. I have learned this platform on my own, and fumbled through the process throughout the way, as I needed. PLEASE feel free to make this series better with your insights!
Secret Management
If you're working with APIs, you likely have a token that you need to utilize to hit the API endpoints you want to ingest to Databricks. You don't want to store these tokens in plain text within a Notebook. So, let's store them securely within Databricks Secrets.
Your First Notebook!
Now that we've done all of the prep work, we finally get to the exciting part; your first Notebook! We're going to start very simply, and import our Okta users to Databricks. As we progress through this series, you will notice a pattern with data ingestion from an API endpoint.
Let's assume your secret scope is called shared_scope and your API token is saved as okta_secret. Create a Notebook cell with the following Python code:
import os.path
import json
import math
from datetime import date, datetime
from pyspark.sql.functions import *
#get today's date for historical records
today = date.today()
# get the secret that was created via CLI (scope="shared_scope", key="okta_key")
okta_secret = dbutils.secrets.getBytes('shared_scope', 'okta-key')
Now that we have the secret in the Notebook, we need to decrypt it. I usually make a new cell for the decryption process.
try:
okta_key = okta_secret.decode()
except:
print("Invalid JSON string:", okta_secret.decode())
print("Error:", e)
Next, we'll paginate through the Okta users list. I know there is an SDK for most Okta operations, but I don't think everything is covered in the SDK, and I'm honestly not sure how they handle rate limiting within the SDK, so I usually like to just build my own calls instead of relying on the SDK. This is likely one of those steps that can be improved, but it works for me, so I've just kept it as is.
import requests
# Instantiating with a Python dictionary in the constructor
config = {
'orgUrl': 'https://my-okta-org.okta.com',
}
okta_header = {"Authorization": 'SSWS ' + okta_key}
# paginate through to collect all users
all_users=[]
ret= requests.get(config['orgUrl'] + '/api/v1/users?search=status pr', headers=okta_header)
nextUrl=ret.headers['link'].split(',')
nextUrl=nextUrl[1].split(';')
nextUrl=nextUrl[0].replace('>','').replace('<', '')
users=ret.json()
all_users=ret.json()
print('Got ' + str(len(all_users)) + ' users')
at_end=False
while at_end==False:
ret=requests.get(nextUrl, headers=okta_header)
all_users+=ret.json()
if ',' in ret.headers['link']:
nextUrl=ret.headers['link'].split(',')
if ';' in nextUrl[1]:
nextUrl=nextUrl[1].split(';')
nextUrl=nextUrl[0].replace('>','').replace('<', '')
print('Got ' + str(len(all_users)) + ' users')
else:
break
else:
at_end=True
At the end of this cell, we now have a collection of all users with their default and custom attributes as a JSON object. Next, I like t be a little extra and add a collection date to the data. This way, I can see the shift in data between specific dates or over time.
new_coll = []
for one in all_users:
# Create a copy of the dictionary to avoid modifying the original
updated_one = one.copy()
# Add the new key-value pair
updated_one['as_of_date'] = str(today)
# Append the updated dictionary to the list
new_coll.append(updated_one)
all_users = new_coll
Now that we have all of our data, we need to define the schema for our dataframe. If your user profiles contain custom attributes, you need to define them here!
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, BooleanType, ArrayType
# Create a SparkSession
spark = SparkSession.builder.appName("OktaUser").getOrCreate()
# Define the profile attributes.
profile_attrs = StructType([
StructField("lastName", StringType(), True),
StructField("hireDate", StringType(), True),
StructField("timezone", StringType(), True),
StructField("manager", StringType(), True),
StructField("city", StringType(), True),
StructField("accountType", StringType(), True),
StructField("secondEmail", StringType(), True),
StructField("title", StringType(), True),
StructField("login", StringType(), True),
StructField("firstName", StringType(), True),
StructField("costCenter", StringType(), True),
StructField("streetAddress", StringType(), True),
StructField("countryCode", StringType(), True),
StructField("state", StringType(), True),
StructField("department", StringType(), True),
StructField("email", StringType(), True)
])
# Define the default schema. These attributes are universal in Okta, except as_of_date, which we added earlier
schema = StructType([
StructField("id", StringType(), True),
StructField("status", StringType(), True),
StructField("created", StringType(), True),
StructField("activated", StringType(), True),
StructField("statusChanged", StringType(), True),
StructField("lastLogin", StringType(), True),
StructField("lastUpdated", StringType(), True),
StructField("passwordChanged", StringType(), True),
StructField("type", StructType([
StructField("id", StringType(), True)
]), True),
StructField("profile", profile_attrs, True), # here, we use the profile schema defined above
StructField("as_of_date", StringType(), False) # Make sure to define your as_of_date so you can capture the "date of data"
])
#use the schema to cast the JSON object to a dataframe
df = spark.createDataFrame(all_users, schema)
# Display the DataFrame (Optional)
# df.show(1, False)
# This step explodes the raw data into more columns so that it's easier to read. Notice the profile.* selection. That statement will take every attribute nested within the "profile" object and "explode" it into its own column!
data_expanded=df.select("id", "status", "created", "activated", "statusChanged", "lastLogin", "lastUpdated", "passwordChanged", "type", "profile.*", "as_of_date")
data_expanded.show(1, True) # this will display 1 row of the dataframe on the screen
Now that we have our dataframe, we need to write it to a table. We want to make sure we append the data, and NOT overwrite. This will ensure that we have yesterday's data as well as all future data for pattern analysis and other neat options later! For this example, I am saving to my personal schema. However, in a production environment, you would want to save this to a shared schema, protected by Unity Catalog permissions.
I also save both the raw data and the "exploded" data, just in case we need to do any future transforms on the raw data at a later time.
# Save raw data to table. I believe this is referred to as the "bronze" layer
df.write.option("mergeschema", "true").saveAsTable("users.jack_zaldivar.okta_users_bronze", mode="append")
# Save "exploded" data to another table. I believe this is referred to as the "silver" layer
data_expanded.write.option("mergeschema", "true").saveAsTable("users.jack_zaldivar.okta_users_silver", mode="append")
That's it! Assuming everything worked correctly, You should now have a table in your Databricks instance with all of your user data.
Schedule it!
Recall that we set an as_of_date attribute earlier. We did this because we plan to run this Notebook on a schedule and continually append the day's data to it. This will give us a running list of all users in our environment and their status/attributes at that specific point in time. Now that we have a working Notebook, let's set a schedule for the Notebook to run daily.
Take a break!
That's it for part 1 of the series! You did great! If you followed along, we were able to:
For the rest of the series, I will discuss importing more datapoints from Okta to Databricks, show how to make meaningful Dashboards for your stakeholders, and even interact with the AI/BI Genie so your less technical users can discover more about your data!
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.