cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Job tasks monitoring

maikel
Contributor III

Hello Community,

We have a case in our project that we would like to solve in an elegant and scalable manner. As always, I would really appreciate your suggestions and experience.

In short:

We have a multi-step job consisting of 4 stages. In one of the stages, the work is split into tasks per business unit. For simplicity, let’s assume there are 500 tasks that must be completed before moving to the next step, where the final data save operation is performed.

For debugging and audit purposes, all tasks are stored in a Delta table with statuses such as IN PROGRESS, READY, COMPLETED, etc. Along with the status, we also store uniqueID and runId so we can identify which tasks belong to a specific job run.

The entire workflow is triggered from our backend API using the Databricks SDK. At the end, we use webhooks to notify whether the job completed successfully or failed.

What we would like to achieve is displaying task progress to the user, for example:

  • 200/500 completed → 40%
  • 250/500 completed → 50%
  • etc.

Since we already maintain a Delta table with task statuses, what would be the best way to communicate this progress back to our backend/frontend layer?

Are there any native Databricks mechanisms or recommended patterns for this kind of monitoring/progress reporting?

One important requirement is that we do not want to calculate or expose progress directly within the processing job itself. We would prefer either:

  • a separate monitoring job/process, or
  • an existing/native Databricks solution if available.

Thanks a lot in advance for your help and recommendations!

1 ACCEPTED SOLUTION

Accepted Solutions

MoJaMa
Databricks Employee
Databricks Employee

I don't think there is anything native for this in Databricks. The closest match would have been system tables (system.lakeflow.job_run_timeline / job_task_run_timeline) but I don't think it will have the necessary grain for what your pattern.

There's probably two different ways to try and think about it.

Approach 1:

  • Enable Change Data Feed on your status Delta table: ALTER TABLE SET TBLPROPERTIES (delta.enableChangeDataFeed = true).
  • Create a Lakebase Postgres instance and a synced table in Continuous mode — it replicates the Delta table to Postgres with a minimum refresh interval of ~15 seconds.
  • Your backend API queries Postgres directly with a plain SELECT COUNT(*) FILTER (WHERE status='COMPLETED'), COUNT(*) FROM tasks WHERE run_id = ? — millisecond latency, no Databricks SQL warehouse spin-up cost per request, and no progress logic in the processing job. Lakebase supports up to 1,000 concurrent connections, so you can poll from the frontend safely.
  • Bonus: same Lakebase instance can back other operational lookups for your app.

Approach 2:

  •  Point your backend at a small serverless SQL warehouse and call the Statement Execution API (https://docs.databricks.com/api/workspace/statementexecution) with a parameterized aggregate query keyed by run_id.
  •  Cache results in your backend for a few seconds to avoid hammering the warehouse.
  •  Trade-off: serverless warehouse cold-starts and per-query latency are higher than Postgres; fine for a progress bar polled every 3–10s, less ideal if you need sub-second updates.

you can continue using your existing job-success/fail webhook for the terminal signal. Use approaches above only for the in-flight 200/500 progress updates. That avoids hammering anything when the run is idle.

~Mohan Mathews, Lead DSA.

View solution in original post

3 REPLIES 3

rdokala
New Contributor III

your Delta status table is the right source of truth. I would not rely on Databricks job webhooks for incremental progress; they are mainly for start/success/failure/duration events, not “200/500 completed” style progress.

Pattern:Backend API starts Databricks job->Job writes task-level status to Delta table->Separate lightweight monitor reads Delta by runId->Monitor writes progress snapshot / sends API update->Backend exposes progress to frontend.

Option 1 -Have backend call a query every few seconds

Option 2 — Separate monitoring job
Create a small Databricks job or external service that runs independently from the processing job. It polls the Delta table by runId, calculates progress, and posts updates to your backend endpoint.

 

Option 3:Delta progress snapshot table

Instead of frontend/backend querying the detailed 500-row status table repeatedly, create a compact table:

run_id
total_tasks
completed_tasks
failed_tasks
in_progress_tasks
ready_tasks
progress_pct
last_updated_ts

Then backend reads only one row per run. This is more scalable and API-friendly.

My recommendation: Option 3 + backend polling. Use the Delta status table as source of truth, maintain a compact progress snapshot table, and let frontend poll your backend endpoint like:

GET /jobs/{runId}/progress

MoJaMa
Databricks Employee
Databricks Employee

I don't think there is anything native for this in Databricks. The closest match would have been system tables (system.lakeflow.job_run_timeline / job_task_run_timeline) but I don't think it will have the necessary grain for what your pattern.

There's probably two different ways to try and think about it.

Approach 1:

  • Enable Change Data Feed on your status Delta table: ALTER TABLE SET TBLPROPERTIES (delta.enableChangeDataFeed = true).
  • Create a Lakebase Postgres instance and a synced table in Continuous mode — it replicates the Delta table to Postgres with a minimum refresh interval of ~15 seconds.
  • Your backend API queries Postgres directly with a plain SELECT COUNT(*) FILTER (WHERE status='COMPLETED'), COUNT(*) FROM tasks WHERE run_id = ? — millisecond latency, no Databricks SQL warehouse spin-up cost per request, and no progress logic in the processing job. Lakebase supports up to 1,000 concurrent connections, so you can poll from the frontend safely.
  • Bonus: same Lakebase instance can back other operational lookups for your app.

Approach 2:

  •  Point your backend at a small serverless SQL warehouse and call the Statement Execution API (https://docs.databricks.com/api/workspace/statementexecution) with a parameterized aggregate query keyed by run_id.
  •  Cache results in your backend for a few seconds to avoid hammering the warehouse.
  •  Trade-off: serverless warehouse cold-starts and per-query latency are higher than Postgres; fine for a progress bar polled every 3–10s, less ideal if you need sub-second updates.

you can continue using your existing job-success/fail webhook for the terminal signal. Use approaches above only for the in-flight 200/500 progress updates. That avoids hammering anything when the run is idle.

~Mohan Mathews, Lead DSA.

maikel
Contributor III

@MoJaMa thanks a lot for these suggestions!