Databricks Community

Kirankumarbs · ‎03-02-2026

Most Databricks streaming failures don't look dramatic.

No cluster termination. No red wall of errors. The UI says RUNNING — and your customers start reporting nonsense.

I wrote about the incident that changed how we think about streaming jobs on shared clusters:

- Why query-scoped failures are more dangerous than driver-scoped ones
- How query.awaitTermination() on each stream individually caused us to miss a silent failure for 12 minutes
- Why don't continuous jobs save you if the JVM never fails in the first place
- The one-line fix (awaitAnyTermination) that stopped the lying — and why it's still a band-aid

Full Post at Medium or Blog

Part 2 (multi-task on a shared cluster — why that's also not enough) coming soon.

I am always happy to share/Learn about Production Insights!!!

wesleyfelipe · ‎03-03-2026

@Kirankumarbs This is really great content!

Streaming monitoring has always been challenging.
I'm planning on writing on a similar situation I faced a few years ago too.

I'm looking forward for the part 2 of your series.

Kirankumarbs · ‎03-03-2026

ThanksI am glad that you liked it!

Indeed, streaming constructs, unit/integration testing, and monitoring are much more involved and complicated than simple batch jobs!

I am already writing the 2nd part and excited to share it probably on March 5th!

Kirankumarbs · ‎03-05-2026

I completed Part 2 as well! Multi-Task on a Shared Cluster — Why That's Also Not Enough

An Interesting read up!

Thanks for reading and Happy to Learn/Share!

Kirankumarbs · ‎03-12-2026

There we go, Part 3 is also available!

Thanks for the encouragement, and I'm glad to write and share such production insights!

wesleyfelipe · ‎03-12-2026

@kiran

I really enjoy reading these kinds of real-world problem cases. I like how practical and grounded your articles are. Sometimes having a solution that solves the problem now is more valuable than following the perfect best practice, especially when you need results quickly.

Congrats on the series!

Kirankumarbs · ‎03-12-2026

Exactly @wesleyfelipe! Solutions should be good enough and improve organically as needed!

mderela · ‎03-14-2026

Good series. The query-scoped vs driver-scoped framing from Part 1 is something I haven’t seen written down clearly before, even though everyone who’s run streaming in prod has hit it.
One thing that kept nagging me reading all three parts: Serverless Jobs never comes up. That’s the obvious answer to “cost is why we haven’t switched.” Per-task isolation, no cluster lifecycle to manage, no cold start tax. What was the reason it was off the table?
Also the ConcurrentAppendException mention at the end of Part 3 is the thing I most want to read about. That’s not a retry problem, that’s Delta isolation levels and isBlindAppend semantics inside foreachBatch. Different beast entirely.

Kirankumarbs · ‎03-15-2026

First of all thanks a lot for reading and encouraging write up!

I really enjoy sharing production journey what I am going through in the form of blog or articles because this is something hard to find in books or docs!

If you are curious about such production guide, I started writing e-book here https://kirankbs.com/ebook/ feel free to check it!

Thanks!

mderela · ‎03-15-2026

Completely agree, production war stories are worth more than any documentation. I’ve eaten enough teeth on production data lake issues to write my own chapter on what can go wrong, whether that’s deploying Databricks in financial institutions or being one of the first to do a full SIEM replacement on production. Some of those scars ended up on my blog: https://dere.la

Databricks Community

Streaming Failure Models: Why "It Didn't Crash" Is the Worst Outcome

Solution Accelerator Series | Digital Twins

Community Alert: Free BrickTalk on Supply Chain Management with Databricks!

🌟 Community Pulse: Your Weekly Roundup! April 20 – 26, 2026

The Lakebase Hub: Official Community Space for Lakebase Insights

Take Control: Customer-Managed Keys for Lakebase Postgres