4 weeks ago
Hello Community,
We have a case in our project that we would like to solve in an elegant and scalable manner. As always, I would really appreciate your suggestions and experience.
In short:
We have a multi-step job consisting of 4 stages. In one of the stages, the work is split into tasks per business unit. For simplicity, let’s assume there are 500 tasks that must be completed before moving to the next step, where the final data save operation is performed.
For debugging and audit purposes, all tasks are stored in a Delta table with statuses such as IN PROGRESS, READY, COMPLETED, etc. Along with the status, we also store uniqueID and runId so we can identify which tasks belong to a specific job run.
The entire workflow is triggered from our backend API using the Databricks SDK. At the end, we use webhooks to notify whether the job completed successfully or failed.
What we would like to achieve is displaying task progress to the user, for example:
Since we already maintain a Delta table with task statuses, what would be the best way to communicate this progress back to our backend/frontend layer?
Are there any native Databricks mechanisms or recommended patterns for this kind of monitoring/progress reporting?
One important requirement is that we do not want to calculate or expose progress directly within the processing job itself. We would prefer either:
Thanks a lot in advance for your help and recommendations!
3 weeks ago
I don't think there is anything native for this in Databricks. The closest match would have been system tables (system.lakeflow.job_run_timeline / job_task_run_timeline) but I don't think it will have the necessary grain for what your pattern.
There's probably two different ways to try and think about it.
Approach 1:
Approach 2:
you can continue using your existing job-success/fail webhook for the terminal signal. Use approaches above only for the in-flight 200/500 progress updates. That avoids hammering anything when the run is idle.
~Mohan Mathews, Lead DSA.
4 weeks ago
your Delta status table is the right source of truth. I would not rely on Databricks job webhooks for incremental progress; they are mainly for start/success/failure/duration events, not “200/500 completed” style progress.
Pattern:Backend API starts Databricks job->Job writes task-level status to Delta table->Separate lightweight monitor reads Delta by runId->Monitor writes progress snapshot / sends API update->Backend exposes progress to frontend.
Option 1 -Have backend call a query every few seconds
Option 2 — Separate monitoring job
Create a small Databricks job or external service that runs independently from the processing job. It polls the Delta table by runId, calculates progress, and posts updates to your backend endpoint.
Option 3:Delta progress snapshot table
Instead of frontend/backend querying the detailed 500-row status table repeatedly, create a compact table:
run_id
total_tasks
completed_tasks
failed_tasks
in_progress_tasks
ready_tasks
progress_pct
last_updated_ts
Then backend reads only one row per run. This is more scalable and API-friendly.
My recommendation: Option 3 + backend polling. Use the Delta status table as source of truth, maintain a compact progress snapshot table, and let frontend poll your backend endpoint like:
GET /jobs/{runId}/progress
3 weeks ago
I don't think there is anything native for this in Databricks. The closest match would have been system tables (system.lakeflow.job_run_timeline / job_task_run_timeline) but I don't think it will have the necessary grain for what your pattern.
There's probably two different ways to try and think about it.
Approach 1:
Approach 2:
you can continue using your existing job-success/fail webhook for the terminal signal. Use approaches above only for the in-flight 200/500 progress updates. That avoids hammering anything when the run is idle.
~Mohan Mathews, Lead DSA.
2 weeks ago
@MoJaMa thanks a lot for these suggestions!