Hi all - Matt Jones here, I’m on the Data Streaming team at Databricks and wanted to share a few takeaways from last week’s Current 2022 data streaming event (formerly Kafka Summit) in Austin.
By far the most common question we got at the booth was how/why customers would use Kafka/Confluent and Databricks together. A popular use case is to aggregate streaming events through a Kafka-based collector system, then send that event stream into a Databricks streaming pipeline (or roll your own with Spark Structured Streaming, if you prefer). Frank Munz’s blog post on this topic is an excellent overview.
In addition to a few of the sessions we had at the event, our head of streaming Karthik Ramasamy hosted a meetup that delved into the details of Project Lightspeed, our nextgen Structured Streaming work. As you may know, the meetup format is a great way to get into more conversational depth than a breakout session affords - for example, one of Karthik’s former students at UC Berkeley was getting into the details of how we handle async state checkpointing for low-latency pipelines.
I also had some productive dialogue around what Databricks users want from streaming - low latency is obviously a desirable outcome, but it must be balanced against cost and accuracy (given windowing considerations, late arriving data, etc). Then of course there are scale/throughput considerations. I’d love to hear how your organizations/teams approach this tradeoff.
The ubiquity of streaming use cases was my big takeaway from Current 2022. Performant streaming architecture isn’t a cutting edge set of use cases reserved for high tech; it’s really becoming a democratized practice for everyone from grocery stores to the public sector.
If you were at Current, what was the most impactful/interesting thing you got from the event? If you weren’t able to join us this year, please do add your voice - what’s on your data streaming wish list for the next year?