Between 2019 and 2021, we built a multi-market payments data platform on Databricks that now processes more than 7 billion transactions per year across seven markets.
Ingestion was by far the most operationally complex layer.
To support MongoDB CDC streams, we engineered:
A custom Python CDC publisher
Azure Event Hubs as the message backbone
Avro landing in the raw layer
A generic Spark Structured Streaming framework to load Bronze (Delta)
The architecture worked and scaled — but it required significant custom engineering, careful orchestration, and continuous operational attention as data volume and dataset count increased.
Looking at the capabilities available today, especially Zerobus, it’s hard not to see how much simpler this ingestion layer could become. While still in preview, Zerobus represents a shift toward reducing message-bus dependency, custom streaming frameworks, and ingestion-specific infrastructure.
If it matures as expected, it has strong potential to become the default solution for near–real-time ingestion on Databricks.
I wrote a detailed breakdown of the original architecture, the scaling challenges we encountered, and why Zerobus may fundamentally change how ingestion is designed going forward.
🔗 [Medium] Databricks Lakehouse Without the Workarounds — Part 1: Ingestion