cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Community Articles
Dive into a collaborative space where members like YOU can exchange knowledge, tips, and best practices. Join the conversation today and unlock a wealth of collective wisdom to enhance your experience and drive success.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Streaming Failure Models: Why "It Didn't Crash" Is the Worst Outcome

Kirankumarbs
Contributor

Most Databricks streaming failures don't look dramatic.

No cluster termination. No red wall of errors. The UI says RUNNING โ€” and your customers start reporting nonsense.

I wrote about the incident that changed how we think about streaming jobs on shared clusters:

- Why query-scoped failures are more dangerous than driver-scoped ones
- How query.awaitTermination() on each stream individually caused us to miss a silent failure for 12 minutes
- Why don't continuous jobs save you if the JVM never fails in the first place
- The one-line fix (awaitAnyTermination) that stopped the lying โ€” and why it's still a band-aid

Full Post at Medium or Blog

Part 2 (multi-task on a shared cluster โ€” why that's also not enough) coming soon.

I am always happy to share/Learn about Production Insights!!!

 

9 REPLIES 9

wesleyfelipe
Contributor

@Kirankumarbs  This is really great content!

Streaming monitoring has always been challenging. 
I'm planning on writing on a similar situation I faced a few years ago too.

I'm looking forward for the part 2 of your series.

ThanksI am glad that you liked it!

Indeed, streaming constructs, unit/integration testing, and monitoring are much more involved and complicated than simple batch jobs!

I am already writing the 2nd part and excited to share it probably on March 5th!

Kirankumarbs
Contributor

I completed Part 2 as well! Multi-Task on a Shared Cluster โ€” Why That's Also Not Enough

An Interesting read up!

Thanks for reading and Happy to Learn/Share!

Kirankumarbs
Contributor

There we go, Part 3 is also available!

Thanks for the encouragement, and I'm glad to write and share such production insights!

wesleyfelipe
Contributor

@kiran

I really enjoy reading these kinds of real-world problem cases. I like how practical and grounded your articles are. Sometimes having a solution that solves the problem now is more valuable than following the perfect best practice, especially when you need results quickly.

Congrats on the series!

Exactly @wesleyfelipe! Solutions should be good enough and improve organically as needed!

mderela
Contributor

Good series. The query-scoped vs driver-scoped framing from Part 1 is something I havenโ€™t seen written down clearly before, even though everyone whoโ€™s run streaming in prod has hit it.
One thing that kept nagging me reading all three parts: Serverless Jobs never comes up. Thatโ€™s the obvious answer to โ€œcost is why we havenโ€™t switched.โ€ Per-task isolation, no cluster lifecycle to manage, no cold start tax. What was the reason it was off the table?
Also the ConcurrentAppendException mention at the end of Part 3 is the thing I most want to read about. Thatโ€™s not a retry problem, thatโ€™s Delta isolation levels and isBlindAppend semantics inside foreachBatch. Different beast entirely.

First of all thanks a lot for reading and encouraging write up!

I really enjoy sharing production journey what I am going through in the form of blog or articles because this is something hard to find in books or docs!

If you are curious about such production guide, I started writing e-book here https://kirankbs.com/ebook/ feel free to check it!

Thanks!

mderela
Contributor

Completely agree, production war stories are worth more than any documentation. Iโ€™ve eaten enough teeth on production data lake issues to write my own chapter on what can go wrong, whether thatโ€™s deploying Databricks in financial institutions or being one of the first to do a full SIEM replacement on production. Some of those scars ended up on my blog: https://dere.la