โ03-02-2026 11:53 AM - edited โ03-02-2026 12:18 PM
Most Databricks streaming failures don't look dramatic.
No cluster termination. No red wall of errors. The UI says RUNNING โ and your customers start reporting nonsense.
I wrote about the incident that changed how we think about streaming jobs on shared clusters:
- Why query-scoped failures are more dangerous than driver-scoped ones
- How query.awaitTermination() on each stream individually caused us to miss a silent failure for 12 minutes
- Why don't continuous jobs save you if the JVM never fails in the first place
- The one-line fix (awaitAnyTermination) that stopped the lying โ and why it's still a band-aid
Part 2 (multi-task on a shared cluster โ why that's also not enough) coming soon.
I am always happy to share/Learn about Production Insights!!!
โ03-03-2026 12:18 AM
@Kirankumarbs This is really great content!
Streaming monitoring has always been challenging.
I'm planning on writing on a similar situation I faced a few years ago too.
I'm looking forward for the part 2 of your series.
โ03-03-2026 04:59 AM
ThanksI am glad that you liked it!
Indeed, streaming constructs, unit/integration testing, and monitoring are much more involved and complicated than simple batch jobs!
I am already writing the 2nd part and excited to share it probably on March 5th!
โ03-05-2026 01:21 PM
I completed Part 2 as well! Multi-Task on a Shared Cluster โ Why That's Also Not Enough
An Interesting read up!
Thanks for reading and Happy to Learn/Share!
โ03-12-2026 04:10 AM
There we go, Part 3 is also available!
Thanks for the encouragement, and I'm glad to write and share such production insights!
โ03-12-2026 05:02 AM
I really enjoy reading these kinds of real-world problem cases. I like how practical and grounded your articles are. Sometimes having a solution that solves the problem now is more valuable than following the perfect best practice, especially when you need results quickly.
Congrats on the series!
โ03-12-2026 06:12 AM
Exactly @wesleyfelipe! Solutions should be good enough and improve organically as needed!
a month ago
Good series. The query-scoped vs driver-scoped framing from Part 1 is something I havenโt seen written down clearly before, even though everyone whoโs run streaming in prod has hit it.
One thing that kept nagging me reading all three parts: Serverless Jobs never comes up. Thatโs the obvious answer to โcost is why we havenโt switched.โ Per-task isolation, no cluster lifecycle to manage, no cold start tax. What was the reason it was off the table?
Also the ConcurrentAppendException mention at the end of Part 3 is the thing I most want to read about. Thatโs not a retry problem, thatโs Delta isolation levels and isBlindAppend semantics inside foreachBatch. Different beast entirely.
a month ago
First of all thanks a lot for reading and encouraging write up!
I really enjoy sharing production journey what I am going through in the form of blog or articles because this is something hard to find in books or docs!
If you are curious about such production guide, I started writing e-book here https://kirankbs.com/ebook/ feel free to check it!
Thanks!
a month ago
Completely agree, production war stories are worth more than any documentation. Iโve eaten enough teeth on production data lake issues to write my own chapter on what can go wrong, whether thatโs deploying Databricks in financial institutions or being one of the first to do a full SIEM replacement on production. Some of those scars ended up on my blog: https://dere.la