Now that we have our Users table all set up (read Part 1 of this series, if you need to catch up), we need to add a bit more data to the mix.
Groups
The next logical piece of data, in my mind, is to grab our Okta groups. We normally assign groups to applications for access in Okta. Therefore, if you are a member of a specific group, we can say that you have access to the respective application. We will need group memberships and application information to make the full picture. However, we need to take one step at a time. So, let's get some group data into Databricks!
Start with a new Notebook, specifically for collecting Group data. I will skip the key retrieval pieces of code that were in the previous post, but feel free to review how to utilize secrets as often as needed.
For this Notebook, I'm going to import the Okta SDK, just to give it a try. I decided to be a little extra and install the modules via logic. Because I always have a message that the kernel may need to restart, I went ahead and threw that in as well.
import importlib.util
import sys
mods = ['nest_asyncio', 'okta']
restart_required = False
for mod in mods:
spec = importlib.util.find_spec(str(mod))
if spec is not None:
print(f'{mod} already installed')
else:
%pip install {mod}
restart_required=True
if restart_required==True:
dbutils.library.restartPython()
Now that we've got our required modules installed, we can begin with the meat and potatoes.
#%pip install okta
import okta
import nest_asyncio
import asyncio
from okta.client import Client as OktaClient
config = {
'orgUrl': 'https://my-okta-org.okta.com', #replace with your org url
'token': okta_key
}
okta_client = OktaClient(config)
async def list_okta_groups():
group_list = [] # running collection
groups, resp, err = await okta_client.list_groups() # get the first 200 groups
while True:
for group in groups:
group_list.append(group) # add each group to the collection
if resp.has_next(): # if there are more groups to query from the system, paginate to complete your collection
groups, err = await resp.next()
else:
break
return group_list # return the collection
if __name__ == '__main__':
nest_asyncio.apply()
groups = asyncio.run(list_okta_groups())
Great! Now, we've got all of the groups, as a JSON object, in a variable called groups. We want to continue our practice of adding today's date to the collection, so we can always look back to see what our environment looked like on any given day.
from datetime import date
new_coll = []
today = date.today()
for one in groups:
# Create a copy of the dictionary to avoid modifying the original
updated_one = one.__dict__.copy()
# Add the new key-value pair
updated_one['as_of_date'] = str(today)
# unpickle the string from the variable "type". I added this recently to help with the "unpickling" process. Your mileage may vary, but my Notebook needed it to process the collection properly. I feel like this is probably something to do with the SDK, specifically. So, we'll use it sparingly going forward.
updated_one['type'] = one.type.split(":")[0]
# Append the updated dictionary to the list
new_coll.append(updated_one)
groups = new_coll
Next, let's define the schema for our dataframe and cast the JSON into a dataframe using the schema!
import json
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, BooleanType, ArrayType
# Create a SparkSession
spark = SparkSession.builder.appName("OktaGroups").getOrCreate()
# Define the schema
schema = StructType([
StructField("id", StringType(), True),
StructField("created", StringType(), True),
StructField("last_updated", StringType(), True),
StructField("last_membership_updated", StringType(), True),
StructField("object_class", StringType(), True),
StructField("type", StringType(), True),
StructField("profile", StructType([
StructField("name", StringType(), True),
StructField("description", StringType(), True)
])),
StructField("as_of_date", StringType(), True),
])
#use base schema for "bronze" layer dataframe definition
df = spark.createDataFrame(groups, schema)
#explode bronze layer for silver layer dataframe definition
df_formatted = df.select("id", "created", "last_updated", "last_membership_updated", "object_class", "type", "profile.name", "profile.description", "as_of_date")
Once again, I have my raw data (bronze layer) defined as the vraiable df, and an "exploded" version of the same data (silver layer) as df_formatted.
Now, let's save the data to a table using the same logic we did for our Users table.
df.write.option("mergeSchema", "true").saveAsTable("user.jack_zaldivar.okta_groups", mode="append")
df_formatted.write.option("mergeSchema", "true").saveAsTable("users.jack_zaldivar.okta_groups_formatted", mode="append")
That's it! You now have a table with all of your group data! Let's move on to Group Rules.
Group Rules
The setup for Group Rules export is basically the same, so we won't rehash that. We can just jump directly into the good stuff.
***Note: I do have this as a separate Notebook so that I can run it as an independent job.
#%pip install okta
import okta
import nest_asyncio
import asyncio
from okta.client import Client as OktaClient
config = {
'orgUrl': 'https://my-okta-org.okta.com', # replace with your actual org url
'token': okta_key
}
okta_client = OktaClient(config)
async def list_okta_group_rules():
rule_list = []
rules, resp, err = await okta_client.list_group_rules()
print(err)
while True:
for rule in rules:
rule_list.append(rule)
if resp.has_next():
rules, err = await resp.next()
else:
break
return rule_list
if __name__ == '__main__':
nest_asyncio.apply()
rules = asyncio.run(list_okta_group_rules())
Once again, I'm taking advantage of the SDK, against my better judgement. However, it gets the job done, so I haven't changed it...for now. Don't forget to add our date!
from datetime import date
new_coll = []
today = date.today()
for one in rules:
# Create a copy of the dictionary to avoid modifying the original
updated_one = one.__dict__.copy()
# Add the new key-value pair
updated_one['as_of_date'] = str(today)
# Append the updated dictionary to the list
new_coll.append(updated_one)
rules = new_coll
Always define your schema for the JSON object
import json
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, BooleanType, ArrayType
# Create a SparkSession
spark = SparkSession.builder.appName("OktaRules").getOrCreate()
# Define the schema
schema = StructType([
StructField("actions", StructType([
StructField("assign_user_to_groups", StructType([
StructField("group_ids", ArrayType(StringType(), True), True)
]))
])),
StructField("conditions", StructType([
StructField("expression", StructType([
StructField("type", StringType(), True),
StructField("value", StringType(), True)
]))
#StructField("people", ArrayType(StringType())))
])),
StructField("created", StringType(), True),
StructField("id", StringType(), True),
StructField("last_updated", StringType(), True),
StructField("name", StringType(), True),
StructField("status", StringType(), True),
StructField("type", StringType(), True),
StructField("as_of_date", StringType(), True)
])
df = spark.createDataFrame(rules, schema)
df_formatted = df.select(explode("actions.assign_user_to_groups.group_ids"), col("conditions.expression.type").alias("expressionType"), "conditions.expression.value", "created", "id", "last_updated", "name", "status", "type", "as_of_date")
Great! We've got our Group Rules schema defined and a bronze and silver layer dataframe defined! Let's dump these to their respective tables and call it a day!
df.write.option("mergeSchema", "true").mode("append").saveAsTable("users.jack_zaldivar.okta_group_rules")
df_formatted.write.option("mergeSchema", "true").mode("append").saveAsTable("users.jack_zaldivar.okta_group_rules_formatted")
Perfect! We're building our data sets! So far, we've got Users, Groups, and Group Rules all exported from Okta to Databricks! Pat yourself on the back! you're doing great!
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.