<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Multi-Task on a Shared Cluster — Why That's Also Not Enough in Community Articles</title>
    <link>https://community.databricks.com/t5/community-articles/multi-task-on-a-shared-cluster-why-that-s-also-not-enough/m-p/150676#M1070</link>
    <description>&lt;P&gt;Part 1:&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;A href="https://community.databricks.com/t5/community-articles/streaming-failure-models-why-quot-it-didn-t-crash-quot-is-the/td-p/149640" target="_blank"&gt;&lt;SPAN&gt;Streaming Failure Models: Why "It Didn't Crash" Is the Worst Outcome&lt;/SPAN&gt;&lt;/A&gt;&lt;BR /&gt;Part 3:&amp;nbsp;&lt;A href="https://community.databricks.com/t5/community-articles/one-cluster-per-task-proven-ready-and-waiting/td-p/150592" target="_blank"&gt;&lt;SPAN&gt;One Cluster per Task — Proven, Ready, and Waiting&lt;/SPAN&gt;&lt;/A&gt;&lt;/P&gt;</description>
    <pubDate>Thu, 12 Mar 2026 11:12:04 GMT</pubDate>
    <dc:creator>Kirankumarbs</dc:creator>
    <dc:date>2026-03-12T11:12:04Z</dc:date>
    <item>
      <title>Multi-Task on a Shared Cluster — Why That's Also Not Enough</title>
      <link>https://community.databricks.com/t5/community-articles/multi-task-on-a-shared-cluster-why-that-s-also-not-enough/m-p/149937#M1055</link>
      <description>&lt;P&gt;&lt;EM&gt;Part 2 of 3 — Databricks Streaming Architecture&lt;/EM&gt;&lt;/P&gt;&lt;HR /&gt;&lt;P&gt;The instinct after &lt;A title="Streaming Failure Models: Why &amp;quot;It Didn't Crash&amp;quot; Is the Worst Outcome" href="https://community.databricks.com/t5/community-articles/streaming-failure-models-why-quot-it-didn-t-crash-quot-is-the/td-p/149640" target="_self"&gt;Part 1&lt;/A&gt; was obvious.&lt;/P&gt;&lt;P&gt;If running eight queries in one task means one failure can hide while others keep running — split them into multiple tasks. Separate concerns. Give each component its own retry boundary.&lt;/P&gt;&lt;P&gt;Right instinct. Wrong infrastructure assumption.&lt;/P&gt;&lt;H2 id="we-tried-it"&gt;We tried it&lt;/H2&gt;&lt;P&gt;While the multi-query incident from Part 1 was still fresh, we were already experimenting with a multi-task approach on a separate workflow. Two tasks, same shared job cluster:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;STRONG&gt;Task 1&lt;/STRONG&gt;: feature extraction — processing sensor data into feature tables&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Task 2&lt;/STRONG&gt;: inference — ML model outputs written to downstream Delta tables&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;Sequential dependency. Task 2 reads what Task 1 writes. Clean separation on paper.&lt;/P&gt;&lt;P&gt;Then Task 2 hit a wall.&lt;/P&gt;&lt;H2 id="the-incident--external-location-mismatch"&gt;The incident — external location mismatch&lt;/H2&gt;&lt;P&gt;Task 2 was writing to a Delta table registered in Unity Catalog. The catalog entry pointed to external location A. The actual data sat at location B.&lt;/P&gt;&lt;P&gt;A misconfiguration. Easy to make during migration, hard to spot before it fails in production.&lt;/P&gt;&lt;P&gt;Task 2 failed. Task 1 kept running.&lt;/P&gt;&lt;P&gt;And here’s where it felt familiar: the job didn’t fail. No restart triggered. One task retrying. The other healthy. The UI said RUNNING.&lt;/P&gt;&lt;P&gt;Same story as Part 1. Different packaging.&lt;/P&gt;&lt;H2 id="the-detail-that-changes-everything-theres-still-one-driver"&gt;The detail that changes everything: there’s still one driver&lt;/H2&gt;&lt;P&gt;Here’s what multi-task on a shared cluster actually looks like at runtime:&lt;/P&gt;&lt;PRE&gt;&lt;SPAN class=""&gt;&lt;SPAN&gt;Multi-Task on a Shared Job Cluster&lt;/SPAN&gt;&lt;/SPAN&gt;

&lt;SPAN class=""&gt;&lt;SPAN&gt;Task 1 (Python Process A)     Task 2 (Python Process B)&lt;/SPAN&gt;&lt;/SPAN&gt;
&lt;SPAN class=""&gt;&lt;SPAN&gt;          \                           /&lt;/SPAN&gt;&lt;/SPAN&gt;
&lt;SPAN class=""&gt;&lt;SPAN&gt;           \                         /&lt;/SPAN&gt;&lt;/SPAN&gt;
&lt;SPAN class=""&gt;&lt;SPAN&gt;            ┌────────────────────────┐&lt;/SPAN&gt;&lt;/SPAN&gt;
&lt;SPAN class=""&gt;&lt;SPAN&gt;            │      Spark Driver      │&lt;/SPAN&gt;&lt;/SPAN&gt;
&lt;SPAN class=""&gt;&lt;SPAN&gt;            │         JVM            │&lt;/SPAN&gt;&lt;/SPAN&gt;
&lt;SPAN class=""&gt;&lt;SPAN&gt;            │    (shared by all)     │&lt;/SPAN&gt;&lt;/SPAN&gt;
&lt;SPAN class=""&gt;&lt;SPAN&gt;            └────────────────────────┘&lt;/SPAN&gt;&lt;/SPAN&gt;
&lt;SPAN class=""&gt;&lt;SPAN&gt;                       │&lt;/SPAN&gt;&lt;/SPAN&gt;
&lt;SPAN class=""&gt;&lt;SPAN&gt;                  Executors&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/PRE&gt;&lt;P&gt;Multiple Python processes. One Spark driver JVM.&lt;/P&gt;&lt;P&gt;Compare that to multi-query single task from Part 1:&lt;/P&gt;&lt;PRE&gt;&lt;SPAN class=""&gt;&lt;SPAN&gt;Multi-Query Single Task&lt;/SPAN&gt;&lt;/SPAN&gt;

&lt;SPAN class=""&gt;&lt;SPAN&gt;     Python Process (single)&lt;/SPAN&gt;&lt;/SPAN&gt;
&lt;SPAN class=""&gt;&lt;SPAN&gt;              │&lt;/SPAN&gt;&lt;/SPAN&gt;
&lt;SPAN class=""&gt;&lt;SPAN&gt;     ┌────────────────────┐&lt;/SPAN&gt;&lt;/SPAN&gt;
&lt;SPAN class=""&gt;&lt;SPAN&gt;     │    Spark Driver    │&lt;/SPAN&gt;&lt;/SPAN&gt;
&lt;SPAN class=""&gt;&lt;SPAN&gt;     │        JVM         │&lt;/SPAN&gt;&lt;/SPAN&gt;
&lt;SPAN class=""&gt;&lt;SPAN&gt;     │  Q1  Q2  Q3 ... Q8 │&lt;/SPAN&gt;&lt;/SPAN&gt;
&lt;SPAN class=""&gt;&lt;SPAN&gt;     └────────────────────┘&lt;/SPAN&gt;&lt;/SPAN&gt;
&lt;SPAN class=""&gt;&lt;SPAN&gt;              │&lt;/SPAN&gt;&lt;/SPAN&gt;
&lt;SPAN class=""&gt;&lt;SPAN&gt;         Executors&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/PRE&gt;&lt;P&gt;The difference between these two diagrams is smaller than it looks. Both share the same driver. Both share the same executors. Multi-task adds Python process separation — but that’s not where streaming failures originate. Streaming failures live in the JVM, in the query scheduler, in the Delta transaction layer. All of which is still shared.&lt;/P&gt;&lt;H2 id="what-multi-task-actually-adds-on-a-shared-cluster"&gt;What multi-task actually adds on a shared cluster&lt;/H2&gt;&lt;P&gt;Splitting into tasks on a shared cluster gives you:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Multiple Python processes on the same driver node&lt;/LI&gt;&lt;LI&gt;Multiple SparkSession lifecycles, each with its own initialisation overhead&lt;/LI&gt;&lt;LI&gt;More listeners, more logging, more scheduler registration&lt;/LI&gt;&lt;LI&gt;Concurrent memory pressure when tasks run in parallel&lt;/LI&gt;&lt;LI&gt;A failing task retrying repeatedly can destabilise the cluster for every other task&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;You get the operational complexity of multiple processes without the isolation you were looking for.&lt;/P&gt;&lt;H2 id="the-fix--and-why-it-works-here"&gt;The fix — and why it works here&lt;/H2&gt;&lt;P&gt;For the external location incident, we added task-level retry configuration: three retries per task on the continuous job. Once exhausted, Databricks restarts the entire job.&lt;/P&gt;&lt;P&gt;It works. And it’s a better failure story than Part 1 — Task 2 eventually fails loudly and triggers a restart rather than running silently while Task 1 keeps writing data nobody will ever resolve.&lt;/P&gt;&lt;P&gt;But here’s the key distinction:&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;it works because Task 1 and Task 2 are sequential.&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;Task 2 depends on Task 1. They don’t run simultaneously. No concurrent driver contention. Failure propagates cleanly up the chain.&lt;/P&gt;&lt;P&gt;Multi-task on a shared cluster is a reasonable pattern for sequential batch ETL. Feature extraction feeds inference. Inference feeds output. Tasks chain, failures surface, retries make sense.&lt;/P&gt;&lt;P&gt;The problem is assuming the same pattern works for parallel long-running streaming. That’s where the shared driver becomes a liability instead of a trade-off.&lt;/P&gt;&lt;H2 id="the-rule-we-wrote-down"&gt;The rule we wrote down&lt;/H2&gt;&lt;P&gt;After both incidents, this became our working principle:&lt;/P&gt;&lt;BLOCKQUOTE&gt;&lt;P&gt;Multi-task on a shared cluster: right for sequential batch ETL, wrong for parallel streaming.&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;The difference is contention. Sequential tasks don’t compete for the driver simultaneously. Parallel streaming queries do — continuously, for the lifetime of the job.&lt;/P&gt;&lt;P&gt;If you’re running parallel streaming on a shared cluster, a multi-query single task with&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;awaitAnyTermination&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;(Part 1) gives you a cleaner failure boundary than splitting into tasks.&lt;/P&gt;&lt;P&gt;If you’re running sequential batch ETL, multi-task with task-level retry is a legitimate approach within the budget constraints of a shared cluster.&lt;/P&gt;&lt;H2 id="but-this-still-isnt-the-real-answer"&gt;But this still isn’t the real answer&lt;/H2&gt;&lt;P&gt;Both fixes share the same problem.&lt;/P&gt;&lt;P&gt;awaitAnyTermination&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;in Part 1 makes query failures loud. Task retry in Part 2 makes task failures recoverable. Neither prevents a failure in one component from affecting the shared driver — and everything attached to it.&lt;/P&gt;&lt;P&gt;The real answer is what we’d resisted for months: one cluster per task. A failure in the inference pipeline that cannot, by construction, affect the ingestion pipeline.&lt;/P&gt;&lt;P&gt;That’s Part 3 — when we made the architectural change, what it cost, and what got better overnight.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;→ Part 3: One Cluster per Task — What Real Isolation Actually Looks Like&lt;BR /&gt;&lt;/STRONG&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 05 Mar 2026 21:19:29 GMT</pubDate>
      <guid>https://community.databricks.com/t5/community-articles/multi-task-on-a-shared-cluster-why-that-s-also-not-enough/m-p/149937#M1055</guid>
      <dc:creator>Kirankumarbs</dc:creator>
      <dc:date>2026-03-05T21:19:29Z</dc:date>
    </item>
    <item>
      <title>Re: Multi-Task on a Shared Cluster — Why That's Also Not Enough</title>
      <link>https://community.databricks.com/t5/community-articles/multi-task-on-a-shared-cluster-why-that-s-also-not-enough/m-p/150676#M1070</link>
      <description>&lt;P&gt;Part 1:&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;A href="https://community.databricks.com/t5/community-articles/streaming-failure-models-why-quot-it-didn-t-crash-quot-is-the/td-p/149640" target="_blank"&gt;&lt;SPAN&gt;Streaming Failure Models: Why "It Didn't Crash" Is the Worst Outcome&lt;/SPAN&gt;&lt;/A&gt;&lt;BR /&gt;Part 3:&amp;nbsp;&lt;A href="https://community.databricks.com/t5/community-articles/one-cluster-per-task-proven-ready-and-waiting/td-p/150592" target="_blank"&gt;&lt;SPAN&gt;One Cluster per Task — Proven, Ready, and Waiting&lt;/SPAN&gt;&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 12 Mar 2026 11:12:04 GMT</pubDate>
      <guid>https://community.databricks.com/t5/community-articles/multi-task-on-a-shared-cluster-why-that-s-also-not-enough/m-p/150676#M1070</guid>
      <dc:creator>Kirankumarbs</dc:creator>
      <dc:date>2026-03-12T11:12:04Z</dc:date>
    </item>
  </channel>
</rss>

