cancel
Showing results for 
Search instead for 
Did you mean: 
Technical Blog
Explore in-depth articles, tutorials, and insights on data analytics and machine learning in the Databricks Technical Blog. Stay updated on industry trends, best practices, and advanced techniques.
cancel
Showing results for 
Search instead for 
Did you mean: 
jack_zaldivar
Databricks Employee
Databricks Employee

In the previous section, we discovered how to pull Applications and Application Groups into Databricks. I briefly mentioned that we generally assign users to groups and then assign groups to applications for access in our system. This practice makes it a bit easier to manage access, especially in a complicated and fast growing environment. Later, we can also use these groups in things like Group Rules to assign applications as birthrights, based on user attributes in our directory.

Though Group Membership is our preferred method of granting access to applications, it is not the ONLY method that Okta provides. Accounts can also be individually assigned to applications, without being in a group. To discover these types of assignments, we'll pull all of the individual users assigned to each app and store that information for analysis and alerting, as necessary. This also helps us to know HOW each user is getting access to an application, in the case of multiple potential access groups being assigned to 1 user.

Let's Begin

As with each previous Notebook, we'll install our libraries for this Notebook and restart our kernel if any of the modules weren't originally present.

import importlib.util
import sys

mods = ['nest_asyncio', 'okta']
restart_required = False
for mod in mods:
  spec = importlib.util.find_spec(str(mod))
  if spec is not None:
    print(f'{mod} already installed')
  else:
    %pip install {mod}
    restart_required=True

if restart_required==True:
  dbutils.library.restartPython()

Don't Forget Your Keys!

Check Part 1 for tips on how to work with Databricks Secrets. Don't skip this step because you'll need your API key for all operations later. In my case, the decoded key is named okta_key.

"Play Some Jazz"

As I've previously noted, I am not a classically trained Data Engineer or Data Scientist by any means. However, I do understand some data structure and how things can fit together to tell a story. Because of this cursory knowledge, I feel like I can kind of experiment and make some things up a bit. With a bit of creativity and a lot of effort, it might just work out somehow. I call this "playing jazz". Not to downplay Jazz players in any way, as I'm also a musician and previously played in Jazz Band in High School. 🙂  

For this next section, I decided to use the SDK as well as a bit of creative collecting via a custom dictionary. So, instead of staying with the SDK sheet music, I decided to play a little Jazz and improv a bit. Let's define the necessary functions.

import okta
import nest_asyncio
import asyncio
from okta.client import Client as OktaClient


config = {
    'orgUrl': 'https://my-okta-org.okta.com',
    'token': okta_key
}

okta_client = OktaClient(config)

async def list_okta_apps():
    app_list = []
    #use the SDK to collect the applications list
    apps, resp, err = await okta_client.list_applications()
    print(f'Got {apps.count}')
    while True:
        for app in apps:
            app_list.append(app)
        if resp.has_next():
            apps, err = await resp.next()
        else:
            break
    return app_list

async def list_app_assignments(apps):
    i = 1
    for app in apps:
        app_users_list = []
        if app['status'] == "ACTIVE":
            query_params = {'limit':'500'}
            app_users, resp, err = await okta_client.list_application_users(app['id'], query_params)
            #because we want to combine the application dictionary with the app user dictionary, 
            #we create a new dictionary to combine the two dicts into one and add that to a list.
            if len(app_users) > 0:
                for app_u in app_users:
                    app_u = app_u.__dict__
                    user_obj = {}
                    user_obj['app_id'] = app['id'] #app ID
                    user_obj['app_label'] = app['label'] #app label/displayName
                    user_obj['id'] = app_u['id'] #app user ID
                    user_obj['app_user_scope'] = app_u['scope'] #how are they assigned to the app
                    if app_u['scope'] == 'GROUP':
                        #if assignment by group, what group?
                        user_obj['group_name'] = app_u['links']['group']['name']
                    else:
                        user_obj['group_name'] = 'N/A'
                    app_users_list.append(user_obj)
                    user_obj = None #reset the custom dict
                while resp.has_next(): #paginate as needed
                    app_users, err = await resp.next()

                    if len(app_users) > 0:
                        for app_u in app_users:
                            app_u = app_u.__dict__
                            user_obj = {}
                            user_obj['app_id'] = app['id']
                            user_obj['app_label'] = app['label']
                            user_obj['id'] = app_u['id']
                            user_obj['app_user_scope'] = app_u['scope']
                            if app_u['scope'] == 'GROUP':
                                user_obj['group_name'] = app_u['links']['group']['name']
                            else:
                                user_obj['group_name'] = 'N/A'
                            app_users_list.append(user_obj)
        print(f'Collected: {len(app_users_list)} users') #print to the screen so we see that something is happening

        i=i+1 #counter for display

    return app_users_list

The Main Event

Here, we'll call our previously defined functions and remove any INACTIVE apps from the results before we start querying the users assigned. Next, we add our as_of_date to the ACTIVE apps in our collection.

nest_asyncio.apply()
apps = []
apps = asyncio.run(list_okta_apps()) or []

#remove inactive apps
new_apps = []
for row in apps:
    if row.status == 'INACTIVE':
        print(f'Skipping {row.label} from list of apps')
    else:
        new_apps.append(row)

#add as_of_date
new_coll = []
for app in new_apps:
    updated_row = app.__dict__.copy()
    updated_row['as_of_date'] = str(today)
    new_coll.append(updated_row)

new_apps = new_coll

Get Some Coffee

Depending on your environment, this next step can take a while. In my environment, it takes roughly 3 hours to complete, due to the the number of users and apps configured. Because this is potentially going to take a while, I recommend allowing this step to process and going to do other things. Maybe revisit one of the previous posts in this series and give me some pointers on how I can improve! 😄 

app_group_users = [] #collection of all apps and app users
count = 1
for item in new_apps:
    print(f'Processing app {count} of {len(new_apps)}: {item["label"]}')
    current_users = asyncio.run(list_app_assignments([item]))
    app_group_users += current_users
    count = count + 1

#add datestamp
new_coll = []
for row in app_group_users:
    # Create a copy of the dictionary to avoid modifying the original
    updated_row = row.copy()
    # Add the new key-value pair
    updated_row['as_of_date'] = str(today)
    # Append the updated dictionary to the list
    new_coll.append(updated_row)
app_groups = new_coll

Define the Schema and Write to Tables

Convert the list of dictionaries to a Dataframe and write to our bronze and silver layers. For silver, I also changed id to user_id, because I thought that id just seemed too vague. I feel like this will be helpful later on when we're writing some SQL and making some dashboard.

import json
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, BooleanType, ArrayType

# Create a SparkSession
spark = SparkSession.builder.appName("OktaAppUsers").getOrCreate()

# Define the schema
schema = StructType([
  StructField("app_id", StringType(), True),
  StructField("app_label", StringType(), True),
  StructField("app_user_scope", StringType(), True),
  StructField("group_name", StringType(), True),
  StructField("id", StringType(), True),
  StructField("as_of_date", StringType(), True)
])

df = spark.createDataFrame(app_groups, schema)
df_formatted = df.select("app_id", "app_label", col("id").alias("user_id"), "app_user_scope", "group_name", "as_of_date")

#write to tables
df.write.option("mergeSchema", "true").saveAsTable("users.jack_zaldivar.okta_prod_app_assignments", mode="append") 
df_formatted.write.option("mergeSchema", "true").saveAsTable("users.jack_zaldivar.okta_prod_app_assignments_formatted", mode="append") 

Another One Down!

If you've come this far, then you have successfully written your Users, Groups, Group Members, and Apps tables. We also have relationships between all of the tables in some form! Thanks for sticking around this far!

Let's go play with some data, shall we?