cancel
Showing results for 
Search instead for 
Did you mean: 
Community Articles
Dive into a collaborative space where members like YOU can exchange knowledge, tips, and best practices. Join the conversation today and unlock a wealth of collective wisdom to enhance your experience and drive success.
cancel
Showing results for 
Search instead for 
Did you mean: 

Streaming Failure Models: Why "It Didn't Crash" Is the Worst Outcome

Kirankumarbs
Contributor

Most Databricks streaming failures don't look dramatic.

No cluster termination. No red wall of errors. The UI says RUNNING — and your customers start reporting nonsense.

I wrote about the incident that changed how we think about streaming jobs on shared clusters:

- Why query-scoped failures are more dangerous than driver-scoped ones
- How query.awaitTermination() on each stream individually caused us to miss a silent failure for 12 minutes
- Why don't continuous jobs save you if the JVM never fails in the first place
- The one-line fix (awaitAnyTermination) that stopped the lying — and why it's still a band-aid

Full Post at Medium or Blog

Part 2 (multi-task on a shared cluster — why that's also not enough) coming soon.

I am always happy to share/Learn about Production Insights!!!

 

9 REPLIES 9

wesleyfelipe
Contributor

@Kirankumarbs  This is really great content!

Streaming monitoring has always been challenging. 
I'm planning on writing on a similar situation I faced a few years ago too.

I'm looking forward for the part 2 of your series.

ThanksI am glad that you liked it!

Indeed, streaming constructs, unit/integration testing, and monitoring are much more involved and complicated than simple batch jobs!

I am already writing the 2nd part and excited to share it probably on March 5th!

Kirankumarbs
Contributor

I completed Part 2 as well! Multi-Task on a Shared Cluster — Why That's Also Not Enough

An Interesting read up!

Thanks for reading and Happy to Learn/Share!

Kirankumarbs
Contributor

There we go, Part 3 is also available!

Thanks for the encouragement, and I'm glad to write and share such production insights!

wesleyfelipe
Contributor

@kiran

I really enjoy reading these kinds of real-world problem cases. I like how practical and grounded your articles are. Sometimes having a solution that solves the problem now is more valuable than following the perfect best practice, especially when you need results quickly.

Congrats on the series!

Exactly @wesleyfelipe! Solutions should be good enough and improve organically as needed!

mderela
Contributor

Good series. The query-scoped vs driver-scoped framing from Part 1 is something I haven’t seen written down clearly before, even though everyone who’s run streaming in prod has hit it.
One thing that kept nagging me reading all three parts: Serverless Jobs never comes up. That’s the obvious answer to “cost is why we haven’t switched.” Per-task isolation, no cluster lifecycle to manage, no cold start tax. What was the reason it was off the table?
Also the ConcurrentAppendException mention at the end of Part 3 is the thing I most want to read about. That’s not a retry problem, that’s Delta isolation levels and isBlindAppend semantics inside foreachBatch. Different beast entirely.

First of all thanks a lot for reading and encouraging write up!

I really enjoy sharing production journey what I am going through in the form of blog or articles because this is something hard to find in books or docs!

If you are curious about such production guide, I started writing e-book here https://kirankbs.com/ebook/ feel free to check it!

Thanks!

mderela
Contributor

Completely agree, production war stories are worth more than any documentation. I’ve eaten enough teeth on production data lake issues to write my own chapter on what can go wrong, whether that’s deploying Databricks in financial institutions or being one of the first to do a full SIEM replacement on production. Some of those scars ended up on my blog: https://dere.la