Databricks Community

tt_921 · 7 hours ago

In the January 2026 release notes, it was announced that: "Pipelines now support queued execution mode, where multiple update requests are automatically queued and executed sequentially instead of failing with conflicts. This simplifies operations for pipelines with frequent update triggers and eliminates the need for manual retry coordination."

However, I am still seeing concurrent runs fail with "RUN_EXECUTION_ERROR: Pipeline update already in progress." I also don't see an option to apply this queue setting in the UI, nor any documentation for this within DAB. I tried setting this in DAB with `queue: enabled: true` that is set for a job, but this does not work.

Has the pipeline queue been working for anyone else?

Lu_Wang_ENB_DBX · 6 hours ago

What’s going on

The January 2026 release note is correct that the engine for Lakeflow Spark Declarative Pipelines supports queued execution, but it’s a backend behavior, not a user-configurable option; there is no UI or DAB field to flip it on or off.
Internally, there has been a “queuing guardrail” and staged rollout work (e.g. discussion on enabling queued update execution and reverting that guardrail only now), so some environments/pipelines still behave as “fail on concurrent StartUpdate” instead of queueing.
Separately, there are open/very recent bugs where Jobs sends duplicate StartUpdate requests or loses the response, causing RUN_EXECUTION_ERROR: Pipeline update already in progress even when only one update is actually running; see ES-1635313 and follow-ons, and Nokia BL-16616 / SUP-27441 where the pipeline completes but the job task shows this error.
The Jobs queue setting (queue.enabled: true) you tried in DAB is a Jobs-level run queue, not the DLT/Lakeflow pipeline queuing feature, so toggling that won’t change the pipeline control-plane behavior.

Below are 3 options

Option 1 – Serialize all updates via one Job (no secondary triggers)

Mitigate by ensuring there is exactly one place that can trigger the pipeline and that it never overlaps runs:

Use a single Lakeflow Job with one pipeline task, and set either:
- max_concurrent_runs = 1, or
- in DAB, on that same job:
```
jobs:
  my_pipeline_job:
    queue:
      enabled: true
```
Remove/avoid: manual “Start” in the pipeline UI, other Jobs, or ad-hoc API calls that can also call StartUpdate.

This won’t fix the known Jobs bug cases where one job run issues duplicate StartUpdate, but it does remove most genuine concurrency conflicts.

Option 2 – Treat as known Jobs / pipeline integration bug and add retries

If your pattern is “job task sometimes fails with RUN_EXECUTION_ERROR but the underlying pipeline update actually succeeded” (as in ES-1635313 / Nokia cases), treat this as a transient integration bug:

In DAB, add a small max_retries on the pipeline task itself so the job auto-retries when it sees this specific error.
Operationally treat these failures as “false negatives” for now (confirm via go/dlt/debug / event log that only one update ran and completed).

This doesn’t give you true queuing semantics, but it makes the symptom operationally tolerable until the backend rollout and Jobs fixes are fully in place.

Option 3 – Escalate via Support/ES to get attached to existing incidents (Recommended)

Given you’re still seeing RUN_EXECUTION_ERROR post-announcement and there are active incidents (ES-1635313, BL-16616/SUP-27441) specifically around this error and queued/duplicate StartUpdate behavior:

Open a Support ticket (or ES via Support) with:
- Workspace ID, pipeline ID, and a few failing Job run URLs.
- Note that this is “Lakeflow Pipelines + Lakeflow Jobs: RUN_EXECUTION_ERROR: Pipeline update already in progress despite queued execution being announced; please check against ES-1635313 / BL-16616 behavior.”
In parallel, apply Option 1 (single orchestrating job, max_concurrent_runs or job queue) and, if needed, Option 2 (retries) as immediate mitigations.

Recommendation:
Use Option 3 as the primary path so engineering can confirm whether your workspace/pipelines are on the queued-execution rollout and attach you to the ongoing fixes, and in the meantime implement Option 1 (single orchestrator + serialization) plus light retries from Option 2 to reduce operational pain.

View solution in original post

Lu_Wang_ENB_DBX · 6 hours ago

What’s going on

The January 2026 release note is correct that the engine for Lakeflow Spark Declarative Pipelines supports queued execution, but it’s a backend behavior, not a user-configurable option; there is no UI or DAB field to flip it on or off.
Internally, there has been a “queuing guardrail” and staged rollout work (e.g. discussion on enabling queued update execution and reverting that guardrail only now), so some environments/pipelines still behave as “fail on concurrent StartUpdate” instead of queueing.
Separately, there are open/very recent bugs where Jobs sends duplicate StartUpdate requests or loses the response, causing RUN_EXECUTION_ERROR: Pipeline update already in progress even when only one update is actually running; see ES-1635313 and follow-ons, and Nokia BL-16616 / SUP-27441 where the pipeline completes but the job task shows this error.
The Jobs queue setting (queue.enabled: true) you tried in DAB is a Jobs-level run queue, not the DLT/Lakeflow pipeline queuing feature, so toggling that won’t change the pipeline control-plane behavior.

Below are 3 options

Option 1 – Serialize all updates via one Job (no secondary triggers)

Mitigate by ensuring there is exactly one place that can trigger the pipeline and that it never overlaps runs:

Use a single Lakeflow Job with one pipeline task, and set either:
- max_concurrent_runs = 1, or
- in DAB, on that same job:
```
jobs:
  my_pipeline_job:
    queue:
      enabled: true
```
Remove/avoid: manual “Start” in the pipeline UI, other Jobs, or ad-hoc API calls that can also call StartUpdate.

This won’t fix the known Jobs bug cases where one job run issues duplicate StartUpdate, but it does remove most genuine concurrency conflicts.

Option 2 – Treat as known Jobs / pipeline integration bug and add retries

If your pattern is “job task sometimes fails with RUN_EXECUTION_ERROR but the underlying pipeline update actually succeeded” (as in ES-1635313 / Nokia cases), treat this as a transient integration bug:

In DAB, add a small max_retries on the pipeline task itself so the job auto-retries when it sees this specific error.
Operationally treat these failures as “false negatives” for now (confirm via go/dlt/debug / event log that only one update ran and completed).

This doesn’t give you true queuing semantics, but it makes the symptom operationally tolerable until the backend rollout and Jobs fixes are fully in place.

Option 3 – Escalate via Support/ES to get attached to existing incidents (Recommended)

Given you’re still seeing RUN_EXECUTION_ERROR post-announcement and there are active incidents (ES-1635313, BL-16616/SUP-27441) specifically around this error and queued/duplicate StartUpdate behavior:

Open a Support ticket (or ES via Support) with:
- Workspace ID, pipeline ID, and a few failing Job run URLs.
- Note that this is “Lakeflow Pipelines + Lakeflow Jobs: RUN_EXECUTION_ERROR: Pipeline update already in progress despite queued execution being announced; please check against ES-1635313 / BL-16616 behavior.”
In parallel, apply Option 1 (single orchestrating job, max_concurrent_runs or job queue) and, if needed, Option 2 (retries) as immediate mitigations.

Recommendation:
Use Option 3 as the primary path so engineering can confirm whether your workspace/pipelines are on the queued-execution rollout and attach you to the ongoing fixes, and in the meantime implement Option 1 (single orchestrator + serialization) plus light retries from Option 2 to reduce operational pain.

tt_921 · 5 hours ago

Thank you very much for the detailed response! We unfortunately can't proceed with option 1, as we do require multiple places that can trigger the pipeline (an API call to the parent job, and a direct API call to the pipeline itself). This is due to the specific configurable options available in a pipeline API call vs a job API call, namely `full_refresh_selection` to fully refresh specific tables.

We do have queue enabled at the job level and a small `max_retries` on the pipeline.

For now it seems we will need to open a ticket and wait until the pipeline execution queue is fully rolled out.