At this past year’s Data and AI Summit we launched an event focused chatbot called Brickbot. When we conceived the project, we had high ambitions of an intelligent and omniscient multimodal agent that could create personalized schedules, and handle real time event data like unexpected announcements and live presentations. Unfortunately, we were short of both time and resources, so we had to ship a more scoped down proof of concept to get the ball rolling. Despite all that, Brickbot was still a success in its early form (you can read the debrief here), so we’re excited to have the opportunity in this upcoming year to add some of our wishlist features in time for 2025’s Data and AI Summit.
One of the features we’ve started to prototype is live video and audio transcription. Adding real time sources like this to Brickbot is not only plain cool, allowing attendees to directly apply the latest models directly to live content, but we hope is also the base from which other interesting features could be derived in the future. Here’s a rough snippet of what we imagine the experience could be:
As always there are aspects of our imagined endstate that exist as ready-made solutions already, but we’re challenging ourselves to not only build it ourselves, but to build it as much as possible with Databricks. Here’s some of the things we’re thinking about as we prototype realtime audio transcription with Databricks.
Chat, code, AND do the work!?
It’s a pure joy to work on something that can exploit the full breadth of LLM capabilities.
Chatting? Since we’re not audio and speech-to-text experts, it’s extremely helpful to just ask the LLM - “how would you transcribe audio in realtime?" or “I think we should just use the OpenAi whisper model, but what am I missing?”. Conversing in plain language lets us ask the stupid questions safely.
Coding? Still imperfect, unfortunately, but still, having a computer get you 80% of the way is nothing short of magic.
Just doing the work? If giving you the code isn’t magic enough, simply doing the task must surely be? Our current approach to getting the absolute best SLA is to consistently transcribe a sliding window of audio - that’s what we need to get that realtime caption experience. Unfortunately that presents a new challenge about how to stitch together transcription segments like:
Transcribing segment 15...
Took: 739
[" implemented in among other projects and why it's such an interesting language to look at and use.", ' So yeah, this is the agenda for today. Some introductions, why...']
Transcribing segment 16...
Took: 776
[" and why it's such an interesting language to look at and use.", ' So yeah, this is the agenda for today.', ' Some introductions, why Ska language is interesting.']
Transcribing segment 17...
Took: 811
[' to look at and use.', ' So yeah, this is the agenda for today.', ' Some introductions, why the SCAR language is interesting,', ' some comparisons versus other languages.']
Transcribing segment 18...
Took: 845
[' agenda for today. Some introductions, why SCAR language is interesting, some comparisons', ' versus other languages and the conclusion.']
Transcribing segment 19...
Took: 726
[' some introductions, why the Scala language is interesting, some comparisons', " versus other languages and the conclusion. So I've been using Scala since 2012"]
If you look closely you’ll see a number of problems like that Scala is sometimes interpreted as SCAR but also the Ska language, and that generally transcription is messy at the boundaries. Having very little audio is great for fast processing, but terrible for accuracy.
Now, you can ask the LLM for state of the art solutions on how to reconcile transcription segments, but only until it’s clear you shouldn’t bother. We just ask it to do it, and find it does it very well.
It’s plausible we could even stuff the messy transcription directly into the context of user questions and skip the reconciliation altogether. We’ll see how things go when we start testing the experience.
Compound AI
Although not our intention, it quickly dawned on us that we are building what are now being referred to as Compound AI systems. Compound AI systems, as defined by the Berkeley AI Research (BAIR) blog, are systems that tackle AI tasks by combining multiple interacting components. These components can include multiple calls to models, retrievers or external tools. In the case of our live transcription we have both a speech-to-text model and an LLM working together to provide real time captions and context for question suggestion and answering.
Accuracy, Evals
We plan to play with audio segments sizes, like 10 vs 7 vs 5 second sliding windows, but we can already see anecdotally that the sweet spot is likely around 10 seconds due to our SLA requirements. Anything less and the quality drops off significantly, and anything more is too much of a lag.
With that said we have various use cases for the transcripts that can tolerate different error rates. For injection to the context we can tolerate high errors, since it’s not user facing and we can also provide dependable static content in the context like talk title, descriptions, category etc to counteract messy transcripts. This all helps to overcome poor transcriptions like Scala as SCAR. For on screen captions though we’ll need something higher quality at the cost of being a little behind.
Evaluation for this kind of use case is luckily straight forward. We have the benefit of years worth of talks that we can batch transcribe (and get super accurate results for) ourselves using the same models, and also use external resources like the YouTube captions which already exist. Comparing our real time results to these sources is as good as we can ask for.
Deploying Directly on Databricks
We’ll certainly use model serving to deploy our models. Some outstanding questions we have are how much of the model to actually put into serving. For example, if we choose to reconcile overlapping transcriptions, we could do so entirely outside the model endpoint, or we could choose to buffer previous audio frames in the endpoint itself so that we only send the most recent bytes. We could also make external LLM calls from that model endpoint to return only clean transcripts. These are some of the decisions we’ll have to make as we make progress.
Prototyping with Clusters and Notebooks
Our serverless offering is growing more and more powerful with every release. Still, there are times when having access to the underlying compute is not only useful, but necessary. For example, we enjoy prototyping in as close to a production state as possible, so streaming actual RTMP streams with ffmpeg (instead of mocking the data) for testing directly in Databricks requires us to drop down to the shell with the magic %sh command. Maybe one day that too will be serverless…!
Closely related to having access to the underlying compute is having the ability to reliably share that compute across multiple notebooks. Why would we need multiple notebooks? Well, sometimes you have blocking cells, especially when you want to simulate a number of things on a single machine. Having multiple independent notebooks that can “talk to each other” turns out to be an amazing feature when you’re hacking away at an idea.
That’s it for now. Excited to share more as we get closer to DAIS2025.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.