cancel
Showing results for 
Search instead for 
Did you mean: 
Technical Blog
Explore in-depth articles, tutorials, and insights on data analytics and machine learning in the Databricks Technical Blog. Stay updated on industry trends, best practices, and advanced techniques.
cancel
Showing results for 
Search instead for 
Did you mean: 
prashsub
Databricks Employee
Databricks Employee

Pre-Requisites:

Before diving into the workflow scoping process, it is essential to install UCX and run the assessment. The assessment will provide the initial scope of workflows in the workspace.

In addition, ensure that UC is enabled in the workspace and a metastore is attached and  the necessary Databricks system tables are enabled and accessible. These system tables provide critical operational data required for analyzing jobs and workflows. 

(In the following analysis, we will be using the terms jobs and workflows interchangeably, they both refer to Databricks Workflows)

Below are the pre-requisites and a brief primer on system tables:

Required System Tables

The following system tables must be enabled to gather the necessary data for scoping workflows:

 

 

System Table

Description

1

system.lakeflow.jobs

Contains metadata about all jobs created in the account, including job IDs, names, and workspace associations.

2

system.billing.list_prices

Provides historical pricing information for various SKUs, enabling cost analysis of workflows.

3

system.lakeflow.job_run_timeline

Tracks the start and end times of job runs, allowing for an analysis of job execution frequency and success rates.

4

system.lakeflow.job_task_run_timeline

Captures task-level execution details, including compute resource usage.

5

system.access.audit

Logs cluster-related actions, such as creation or edits, to identify workflows already running on Unity Catalog-compliant clusters.

6

system.billing.usage

Aggregates billable usage data across jobs to calculate workflow costs.

 

Follow the instructions here to enable these system schemas. In addition, it is advised to reach out to your account teams to enable the HMS Lineage feature, that is available for a limited period of time to assist with migration to Unity Catalog.

 

Key Steps in Workflow Scope Analysis

1. Identifying Relevant Jobs

The first step is to list all jobs in the workspace that are candidates for migration. Filtering criteria are applied to exclude jobs that are irrelevant or deprecated, such as:

  • Jobs with names containing "views," "test," "New," "Onetime," or "UCX."
  • Jobs marked as deleted or untitled.

This ensures that only active and meaningful workflows are considered for migration. The above keywords are just suggestions. You should add filters here based on any internal naming conventions you may have to filter out non-recurring jobs that you may not want as part of your migration scope. Additionally, jobs older than one year (default retention period of system tables) may require validation using external tools like UCX assessment results. If UCX assessment is completed, run this query on the assessment table, instead of on the system table - if jobs that have not ran in over a year are required to be migrated.



%sql

SELECT DISTINCT

   job_id AS job_id,

   workspace_id,

   name

FROM 

   system.lakeflow.jobs

WHERE

   workspace_id = ${WorkspaceID}

   AND name != 'Untitled'

   AND delete_time IS NULL

   AND name NOT ILIKE '%views%'

   AND name NOT ILIKE '%test%'

   AND name NOT LIKE '%New %'

   AND name NOT ILIKE '%Onetime%'

   AND name NOT ILIKE '%UCX%'

 

2. Incorporating Job Metadata

To prioritize workflows, metadata such as the created date of jobs is inferred from change logs. This helps identify legacy workflows that may need special attention during migration.

A temporary view (list_of_jobs) is created to include:

  • Job ID
  • Workspace ID
  • Created date
  • Latest name of the job

This metadata provides a foundation for understanding the lifecycle and relevance of each workflow.



%sql

CREATE OR REPLACE TEMPORARY VIEW list_of_jobs AS (

   WITH created_date AS (

       SELECT DISTINCT

           job_id AS job_id,

           workspace_id,

           name,

           change_time,

           FIRST_VALUE(change_time) OVER (PARTITION BY job_id ORDER BY change_time ASC) AS created_date,

           FIRST_VALUE(name) OVER (PARTITION BY job_id ORDER BY change_time DESC) AS latest_name

       FROM 

           system.lakeflow.jobs

       WHERE

           workspace_id = ${WorkspaceID}

           AND name != 'Untitled'

           AND delete_time IS NULL

           AND name NOT ILIKE '%views%'

           AND name NOT ILIKE '%test%'

           AND name NOT LIKE '%New %'

           AND name NOT ILIKE '%Onetime%'

           AND name NOT ILIKE '%UCX%'

   )

   SELECT DISTINCT

       job_id,

       created_date,

       workspace_id,

       latest_name

   FROM

       created_date

)

 

3. Analyzing Job Run Frequency

The frequency of job runs is a strong indicator of workflow importance. By grouping job runs by quarter, organizations can assess how often a workflow is executed and its temporal significance.

Metrics include:

  • Total number of successful runs.
  • Quarterly breakdown of runs (Q1, Q2, Q3, Q4 etc.).

Jobs with high execution frequency should be prioritized as they likely represent critical business processes.



%sql

CREATE OR REPLACE TEMPORARY VIEW count_of_runs AS (



WITH total_runs AS (

 SELECT job_id, COUNT(*) AS Total_Runs, min(period_start_time) as earliest_execution_time , max(period_end_time) as latest_execution_time

 FROM system.lakeflow.job_run_timeline

 WHERE result_state = 'SUCCEEDED'

 AND workspace_id = ${WorkspaceID}

 GROUP BY job_id

),



q1_runs AS (

 SELECT jrt.job_id, COUNT(*) AS q1_Runs

 FROM system.lakeflow.job_run_timeline jrt

 WHERE result_state = 'SUCCEEDED'

   AND jrt.period_start_time >= '2024-01-01'

   AND jrt.period_start_time < '2024-04-01'

   AND workspace_id = ${WorkspaceID}

 GROUP BY job_id

),



q2_runs AS (

 SELECT jrt.job_id, COUNT(*) AS q2_Runs

 FROM system.lakeflow.job_run_timeline jrt

 WHERE result_state = 'SUCCEEDED'

   AND jrt.period_start_time >= '2024-04-01'

   AND jrt.period_start_time < '2024-07-01'

   AND workspace_id = ${WorkspaceID}

 GROUP BY job_id

),



q3_runs AS (

 SELECT jrt.job_id, COUNT(*) AS q3_Runs

 FROM system.lakeflow.job_run_timeline jrt

 WHERE result_state = 'SUCCEEDED'

   AND jrt.period_start_time >= '2024-07-01'

   AND jrt.period_start_time < '2024-10-01'

   AND workspace_id = ${WorkspaceID}

 GROUP BY job_id

),



q4_runs AS (

 SELECT jrt.job_id, COUNT(*) AS q4_and_2025_Runs

 FROM system.lakeflow.job_run_timeline jrt

 WHERE result_state = 'SUCCEEDED'

   AND jrt.period_start_time >= '2024-10-01'

   AND workspace_id = ${WorkspaceID}

 GROUP BY job_id

)





SELECT t.job_id, Total_Runs, q1_Runs, q2_Runs, q3_Runs, q4_and_2025_Runs, earliest_execution_time, latest_execution_time

FROM total_runs t

LEFT JOIN q1_runs ON t.job_id = q1_runs.job_id

LEFT JOIN q2_runs ON t.job_id = q2_runs.job_id

LEFT JOIN q3_runs ON t.job_id = q3_runs.job_id

LEFT JOIN q4_runs ON t.job_id = q4_runs.job_id

)



prashsub_0-1751983019843.png

Figure 1: Analysis of Workflow Run Frequency

 

4. Estimating Workflow Costs

Cost analysis aggregates the expenses associated with job runs based on usage data and list prices. This allows organizations to identify high-cost workflows that may benefit most from migration to UC's optimized architecture.

A temporary view (job_cost_analysis) calculates:

  • Total cost per job.
  • Comparative cost metrics for prioritization.

Expensive workflows can be flagged for deeper analysis and optimization during migration.



%sql

CREATE OR REPLACE TEMPORARY VIEW job_cost_analysis AS (

SELECT

 t1.usage_metadata.job_id as job_id,

 SUM(t1.usage_quantity * list_prices.pricing.default) AS list_cost

FROM system.billing.usage t1

 INNER JOIN system.billing.list_prices list_prices

   ON

     t1.cloud = list_prices.cloud AND

     t1.sku_name = list_prices.sku_name AND

     t1.usage_start_time >= list_prices.price_start_time AND

     (t1.usage_end_time <= list_prices.price_end_time or list_prices.price_end_time is NULL)

WHERE

 -- t1.workspace=

 t1.workspace_id = ${WorkspaceID}  AND

 t1.sku_name LIKE '%JOBS%' AND

 t1.usage_metadata.job_id IS NOT NULL AND

 t1.usage_metadata.job_run_id IS NOT NULL

GROUP BY job_id

)



prashsub_1-1751983019932.png

Figure 2: Analysis of Workflow List Costs

 

5. Filtering by Job Run Types

Not all job run types are relevant for migration. So we should filter out only the ones that are relevant for our migration effort.

 

JOB_RUNs are included as they represent scheduled, persistent jobs that are formally defined in your Databricks workspace. These jobs have permanent definitions stored in the workspace, including schedules, parameters, cluster configurations, and dependencies. They typically represent production-grade, scheduled workflows that are critical to business operations. They appear in the Jobs UI and can be managed through the Jobs API, making them trackable and maintainable resources.

 

SUBMIT_RUNs are excluded as they are one-time, ad-hoc job submissions that don't persist in the Jobs UI. These are created using the jobs/runs/submit API and represent one-time executions rather than recurring workflows.

 

WORKFLOW_RUNs are excluded as they are triggered programmatically from notebooks using dbutils.notebook.run. These runs are embedded within notebook code rather than existing as standalone job definitions. They execute within the context of parent notebooks and don't have independent configurations in the Jobs UI.

 

This ensures focus on recurring workflows that align with UC's governance model.



%sql

CREATE OR REPLACE TEMPORARY VIEW Type_Job_Run AS (

 SELECT

   DISTINCT jrt.job_id

 FROM

   system.lakeflow.job_run_timeline AS jrt

 WHERE

   jrt.period_start_time > '2024-01-01'

   AND (jrt.run_type = 'JOB_RUN' OR jrt.run_type IS NULL)

   AND jrt.workspace_id = ${WorkspaceID}

)



6. Excluding Already UC-Compliant Workflows

Some workflows may already operate on UC Shared Clusters, making them compliant with UC standards. These jobs can be excluded from migration scope unless specific adjustments are needed.

Key assumptions:

  • If any task within a job uses a shared cluster, the entire workflow is considered UC-compliant.
  • Single-user clusters may also be UC-compliant but require case-by-case evaluation due to potential technical debt (e.g., use of RDDs).

By identifying jobs running on shared clusters, organizations avoid redundant migration efforts and focus resources on non-compliant workflows.



%sql

CREATE OR REPLACE TEMPORARY VIEW uc_job_run_compute AS (

WITH job_run_compute AS (

   SELECT

       DISTINCT jtrt.job_id,

       COALESCE(jrt.run_type, 'UNKNOWN') AS job_run_type,

       TRY_ELEMENT_AT(jtrt.compute_ids, 1) AS compute_id

   FROM

       system.lakeflow.job_task_run_timeline AS jtrt

   INNER JOIN

       system.lakeflow.job_run_timeline AS jrt

   ON

       jtrt.job_id = jrt.job_id

       AND jtrt.job_run_id = jrt.run_id

   WHERE

       jrt.result_state = 'SUCCEEDED'

       AND jrt.workspace_id = ${WorkspaceID}

       AND jtrt.workspace_id = ${WorkspaceID}

),



cluster_id_to_name AS (

   SELECT

       request_params.cluster_name AS cluster_name,

       GET_JSON_OBJECT(response.result, '$.cluster_id') AS cluster_id

   FROM

       system.access.audit

   WHERE

       service_name = 'clusters'

       AND action_name IN ('create', 'edit')

       AND request_params.data_security_mode IN ('USER_ISOLATION')

       AND workspace_id = ${WorkspaceID}

)



SELECT

   DISTINCT jrc.job_id,

   jrc.job_run_type,

   jrc.compute_id,

   cin.cluster_id,

   cin.cluster_name

FROM

   job_run_compute AS jrc

INNER JOIN

   cluster_id_to_name AS cin

ON

   cin.cluster_id = jrc.compute_id

)

 

prashsub_2-1751983020002.png

 

Figure 3: Sample output of workflows that are already running on UC Shared Clusters

 

Bringing It All Together

To finalize the scope of workflows for migration:

  1. Combine filtered job lists with metrics on run frequency, cost analysis, and compliance status.
  2. Exclude jobs already running on shared clusters using an anti-join operation.
  3. Sort workflows by cost, run frequency, and latest execution time to prioritize high-impact migrations.

The resulting dataset provides actionable insights into which workflows should be migrated first, helping stakeholders rationalize scope effectively.



%sql

SELECT

   lj.*,

   jca.*,

   cor.*

FROM

   list_of_jobs lj

INNER JOIN

   Type_Job_Run tjr

   ON lj.job_id = tjr.job_id

LEFT JOIN

   job_cost_analysis jca

   ON lj.job_id = jca.job_id

LEFT JOIN

   count_of_runs cor

   ON lj.job_id = cor.job_id

LEFT ANTI JOIN

   uc_job_run_compute ujrc

   ON lj.job_id = ujrc.job_id

ORDER BY

   jca.list_cost DESC, 

   cor.Total_Runs DESC,

   cor.latest_execution_time DESC

 

prashsub_3-1751983020055.png

Figure 4: Sample output of workflows along with their historic list cost and run details.

 

Next Steps

Use this analysis as a starting point for discussions with subject matter experts (SMEs) and stakeholders. Incorporate additional inputs such as UCX assessment results and business priorities to refine the migration plan further. Furthermore, use the UCX assessment’s workflow code linter to identify patterns of migration required to execute within the workflows that have been identified as in scope as part of the above exercise.

By systematically scoping workflows using metadata, cost metrics, and run frequency, organizations can ensure a smooth transition from HMS to Unity Catalog while maximizing ROI on their Databricks investment. Import the notebook provided below into your environment, to start your own analysis!