Pre-Requisites:
Before diving into the workflow scoping process, it is essential to install UCX and run the assessment. The assessment will provide the initial scope of workflows in the workspace.
In addition, ensure that UC is enabled in the workspace and a metastore is attached and the necessary Databricks system tables are enabled and accessible. These system tables provide critical operational data required for analyzing jobs and workflows.
(In the following analysis, we will be using the terms jobs and workflows interchangeably, they both refer to Databricks Workflows)
Below are the pre-requisites and a brief primer on system tables:
The following system tables must be enabled to gather the necessary data for scoping workflows:
System Table |
Description |
|
1 |
system.lakeflow.jobs |
Contains metadata about all jobs created in the account, including job IDs, names, and workspace associations. |
2 |
system.billing.list_prices |
Provides historical pricing information for various SKUs, enabling cost analysis of workflows. |
3 |
system.lakeflow.job_run_timeline |
Tracks the start and end times of job runs, allowing for an analysis of job execution frequency and success rates. |
4 |
system.lakeflow.job_task_run_timeline |
Captures task-level execution details, including compute resource usage. |
5 |
system.access.audit |
Logs cluster-related actions, such as creation or edits, to identify workflows already running on Unity Catalog-compliant clusters. |
6 |
system.billing.usage |
Aggregates billable usage data across jobs to calculate workflow costs. |
Follow the instructions here to enable these system schemas. In addition, it is advised to reach out to your account teams to enable the HMS Lineage feature, that is available for a limited period of time to assist with migration to Unity Catalog.
The first step is to list all jobs in the workspace that are candidates for migration. Filtering criteria are applied to exclude jobs that are irrelevant or deprecated, such as:
This ensures that only active and meaningful workflows are considered for migration. The above keywords are just suggestions. You should add filters here based on any internal naming conventions you may have to filter out non-recurring jobs that you may not want as part of your migration scope. Additionally, jobs older than one year (default retention period of system tables) may require validation using external tools like UCX assessment results. If UCX assessment is completed, run this query on the assessment table, instead of on the system table - if jobs that have not ran in over a year are required to be migrated.
%sql
SELECT DISTINCT
job_id AS job_id,
workspace_id,
name
FROM
system.lakeflow.jobs
WHERE
workspace_id = ${WorkspaceID}
AND name != 'Untitled'
AND delete_time IS NULL
AND name NOT ILIKE '%views%'
AND name NOT ILIKE '%test%'
AND name NOT LIKE '%New %'
AND name NOT ILIKE '%Onetime%'
AND name NOT ILIKE '%UCX%'
To prioritize workflows, metadata such as the created date of jobs is inferred from change logs. This helps identify legacy workflows that may need special attention during migration.
A temporary view (list_of_jobs) is created to include:
This metadata provides a foundation for understanding the lifecycle and relevance of each workflow.
%sql
CREATE OR REPLACE TEMPORARY VIEW list_of_jobs AS (
WITH created_date AS (
SELECT DISTINCT
job_id AS job_id,
workspace_id,
name,
change_time,
FIRST_VALUE(change_time) OVER (PARTITION BY job_id ORDER BY change_time ASC) AS created_date,
FIRST_VALUE(name) OVER (PARTITION BY job_id ORDER BY change_time DESC) AS latest_name
FROM
system.lakeflow.jobs
WHERE
workspace_id = ${WorkspaceID}
AND name != 'Untitled'
AND delete_time IS NULL
AND name NOT ILIKE '%views%'
AND name NOT ILIKE '%test%'
AND name NOT LIKE '%New %'
AND name NOT ILIKE '%Onetime%'
AND name NOT ILIKE '%UCX%'
)
SELECT DISTINCT
job_id,
created_date,
workspace_id,
latest_name
FROM
created_date
)
The frequency of job runs is a strong indicator of workflow importance. By grouping job runs by quarter, organizations can assess how often a workflow is executed and its temporal significance.
Metrics include:
Jobs with high execution frequency should be prioritized as they likely represent critical business processes.
%sql
CREATE OR REPLACE TEMPORARY VIEW count_of_runs AS (
WITH total_runs AS (
SELECT job_id, COUNT(*) AS Total_Runs, min(period_start_time) as earliest_execution_time , max(period_end_time) as latest_execution_time
FROM system.lakeflow.job_run_timeline
WHERE result_state = 'SUCCEEDED'
AND workspace_id = ${WorkspaceID}
GROUP BY job_id
),
q1_runs AS (
SELECT jrt.job_id, COUNT(*) AS q1_Runs
FROM system.lakeflow.job_run_timeline jrt
WHERE result_state = 'SUCCEEDED'
AND jrt.period_start_time >= '2024-01-01'
AND jrt.period_start_time < '2024-04-01'
AND workspace_id = ${WorkspaceID}
GROUP BY job_id
),
q2_runs AS (
SELECT jrt.job_id, COUNT(*) AS q2_Runs
FROM system.lakeflow.job_run_timeline jrt
WHERE result_state = 'SUCCEEDED'
AND jrt.period_start_time >= '2024-04-01'
AND jrt.period_start_time < '2024-07-01'
AND workspace_id = ${WorkspaceID}
GROUP BY job_id
),
q3_runs AS (
SELECT jrt.job_id, COUNT(*) AS q3_Runs
FROM system.lakeflow.job_run_timeline jrt
WHERE result_state = 'SUCCEEDED'
AND jrt.period_start_time >= '2024-07-01'
AND jrt.period_start_time < '2024-10-01'
AND workspace_id = ${WorkspaceID}
GROUP BY job_id
),
q4_runs AS (
SELECT jrt.job_id, COUNT(*) AS q4_and_2025_Runs
FROM system.lakeflow.job_run_timeline jrt
WHERE result_state = 'SUCCEEDED'
AND jrt.period_start_time >= '2024-10-01'
AND workspace_id = ${WorkspaceID}
GROUP BY job_id
)
SELECT t.job_id, Total_Runs, q1_Runs, q2_Runs, q3_Runs, q4_and_2025_Runs, earliest_execution_time, latest_execution_time
FROM total_runs t
LEFT JOIN q1_runs ON t.job_id = q1_runs.job_id
LEFT JOIN q2_runs ON t.job_id = q2_runs.job_id
LEFT JOIN q3_runs ON t.job_id = q3_runs.job_id
LEFT JOIN q4_runs ON t.job_id = q4_runs.job_id
)
Figure 1: Analysis of Workflow Run Frequency
Cost analysis aggregates the expenses associated with job runs based on usage data and list prices. This allows organizations to identify high-cost workflows that may benefit most from migration to UC's optimized architecture.
A temporary view (job_cost_analysis) calculates:
Expensive workflows can be flagged for deeper analysis and optimization during migration.
%sql
CREATE OR REPLACE TEMPORARY VIEW job_cost_analysis AS (
SELECT
t1.usage_metadata.job_id as job_id,
SUM(t1.usage_quantity * list_prices.pricing.default) AS list_cost
FROM system.billing.usage t1
INNER JOIN system.billing.list_prices list_prices
ON
t1.cloud = list_prices.cloud AND
t1.sku_name = list_prices.sku_name AND
t1.usage_start_time >= list_prices.price_start_time AND
(t1.usage_end_time <= list_prices.price_end_time or list_prices.price_end_time is NULL)
WHERE
-- t1.workspace=
t1.workspace_id = ${WorkspaceID} AND
t1.sku_name LIKE '%JOBS%' AND
t1.usage_metadata.job_id IS NOT NULL AND
t1.usage_metadata.job_run_id IS NOT NULL
GROUP BY job_id
)
Figure 2: Analysis of Workflow List Costs
Not all job run types are relevant for migration. So we should filter out only the ones that are relevant for our migration effort.
JOB_RUNs are included as they represent scheduled, persistent jobs that are formally defined in your Databricks workspace. These jobs have permanent definitions stored in the workspace, including schedules, parameters, cluster configurations, and dependencies. They typically represent production-grade, scheduled workflows that are critical to business operations. They appear in the Jobs UI and can be managed through the Jobs API, making them trackable and maintainable resources.
SUBMIT_RUNs are excluded as they are one-time, ad-hoc job submissions that don't persist in the Jobs UI. These are created using the jobs/runs/submit API and represent one-time executions rather than recurring workflows.
WORKFLOW_RUNs are excluded as they are triggered programmatically from notebooks using dbutils.notebook.run. These runs are embedded within notebook code rather than existing as standalone job definitions. They execute within the context of parent notebooks and don't have independent configurations in the Jobs UI.
This ensures focus on recurring workflows that align with UC's governance model.
%sql
CREATE OR REPLACE TEMPORARY VIEW Type_Job_Run AS (
SELECT
DISTINCT jrt.job_id
FROM
system.lakeflow.job_run_timeline AS jrt
WHERE
jrt.period_start_time > '2024-01-01'
AND (jrt.run_type = 'JOB_RUN' OR jrt.run_type IS NULL)
AND jrt.workspace_id = ${WorkspaceID}
)
Some workflows may already operate on UC Shared Clusters, making them compliant with UC standards. These jobs can be excluded from migration scope unless specific adjustments are needed.
Key assumptions:
By identifying jobs running on shared clusters, organizations avoid redundant migration efforts and focus resources on non-compliant workflows.
%sql
CREATE OR REPLACE TEMPORARY VIEW uc_job_run_compute AS (
WITH job_run_compute AS (
SELECT
DISTINCT jtrt.job_id,
COALESCE(jrt.run_type, 'UNKNOWN') AS job_run_type,
TRY_ELEMENT_AT(jtrt.compute_ids, 1) AS compute_id
FROM
system.lakeflow.job_task_run_timeline AS jtrt
INNER JOIN
system.lakeflow.job_run_timeline AS jrt
ON
jtrt.job_id = jrt.job_id
AND jtrt.job_run_id = jrt.run_id
WHERE
jrt.result_state = 'SUCCEEDED'
AND jrt.workspace_id = ${WorkspaceID}
AND jtrt.workspace_id = ${WorkspaceID}
),
cluster_id_to_name AS (
SELECT
request_params.cluster_name AS cluster_name,
GET_JSON_OBJECT(response.result, '$.cluster_id') AS cluster_id
FROM
system.access.audit
WHERE
service_name = 'clusters'
AND action_name IN ('create', 'edit')
AND request_params.data_security_mode IN ('USER_ISOLATION')
AND workspace_id = ${WorkspaceID}
)
SELECT
DISTINCT jrc.job_id,
jrc.job_run_type,
jrc.compute_id,
cin.cluster_id,
cin.cluster_name
FROM
job_run_compute AS jrc
INNER JOIN
cluster_id_to_name AS cin
ON
cin.cluster_id = jrc.compute_id
)
Figure 3: Sample output of workflows that are already running on UC Shared Clusters
To finalize the scope of workflows for migration:
The resulting dataset provides actionable insights into which workflows should be migrated first, helping stakeholders rationalize scope effectively.
%sql
SELECT
lj.*,
jca.*,
cor.*
FROM
list_of_jobs lj
INNER JOIN
Type_Job_Run tjr
ON lj.job_id = tjr.job_id
LEFT JOIN
job_cost_analysis jca
ON lj.job_id = jca.job_id
LEFT JOIN
count_of_runs cor
ON lj.job_id = cor.job_id
LEFT ANTI JOIN
uc_job_run_compute ujrc
ON lj.job_id = ujrc.job_id
ORDER BY
jca.list_cost DESC,
cor.Total_Runs DESC,
cor.latest_execution_time DESC
Figure 4: Sample output of workflows along with their historic list cost and run details.
Use this analysis as a starting point for discussions with subject matter experts (SMEs) and stakeholders. Incorporate additional inputs such as UCX assessment results and business priorities to refine the migration plan further. Furthermore, use the UCX assessment’s workflow code linter to identify patterns of migration required to execute within the workflows that have been identified as in scope as part of the above exercise.
By systematically scoping workflows using metadata, cost metrics, and run frequency, organizations can ensure a smooth transition from HMS to Unity Catalog while maximizing ROI on their Databricks investment. Import the notebook provided below into your environment, to start your own analysis!
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.