cancel
Showing results for 
Search instead for 
Did you mean: 
Technical Blog
Explore in-depth articles, tutorials, and insights on data analytics and machine learning in the Databricks Technical Blog. Stay updated on industry trends, best practices, and advanced techniques.
cancel
Showing results for 
Search instead for 
Did you mean: 
anujsen18
Contributor

Introduction

TL;DR

ZeroBus Ingest is a serverless, Kafka-free ingestion service in Databricks that allows applications and IoT devices to stream data directly into Delta Lake with low latency and minimal operational overhead.

Real-time data ingestion is a core requirement for modern IoT and event-driven architectures. Traditionally, platforms like Apache Kafka have been used as an intermediary layer between producers and analytics systems, adding operational complexity and latency.

Zerobus Ingest is Databricks’ Kafka-free ingestion solution that allows applications and devices to write events directly into Delta tables with low latency and minimal infrastructure. In this article, we explore how Zerobus works, when to use it, and how to ingest real-time events step by step.

Zerobus connector Producers simply push data using a lightweight API, and Zerobus takes care of buffering, reliability, and scaling behind the scenes.

The Challenge with Traditional Streaming Architectures

Traditional real-time data pipelines often rely on messaging systems like Kafka to move data from applications. While effective, this approach introduces several challenges, such as:

  • Architecture Complexity: The introduction of a message bus adds extra layers to the data pipeline, making the overall architecture more complex and harder to maintain.
  • Operational Overhead: Managing and operating streaming infrastructure requires continuous effort across scaling, monitoring, security, and fault handling.
  • Increased Latency: Multiple handoffs between systems introduce delays, limiting the ability to deliver near real-time insights.

How Zerobus Ingest Enables Real-Time Ingestion in Databricks

Zerobus solves and simplify the ingestion process by:

  • Simplified architecture: Eliminates the need for a message broker by ingesting events directly into Delta Lake.
  • Lower operational overhead: Fully managed and serverless ingestion removes the burden of running and scaling streaming infrastructure.
  • Faster time to insights: Low-latency ingestion makes event data immediately available for analytics and reporting in Databricks.

In practice, Zerobus Ingest is designed for event-driven and IoT ingestion use cases where Databricks is the primary analytics platform.

Why Zerobus Matters for Modern Data Architectures

  1. Eliminates the message bus layer (Kafka/Kinesis), reducing pipeline hops and enabling faster real-time data ingestion on Databricks.
  2. Provides near real-time streaming ingestion with low operational overhead, making it simpler and more cost-efficient for event-driven architectures.

Modern data architectures increasingly favor simpler, event-first designs that minimize moving parts while preserving reliability and scale. Zerobus Ingest on Databricks supports this shift by reducing ingestion pipelines to their essential components and removing unnecessary infrastructure between event producers and Delta Lake storage. At the same time, the native integration with Unity Catalog ensures governance, security, and lineage are applied from the moment data is written.

In short, Zerobus is a strong fit when Databricks is the primary destination and the goal is Kafka-free streaming ingestion into Delta Lake.

Key Component and Architecture : 

Zerobus Ingest is implemented as a serverless ingestion layer within Databricks, exposed through gRPC and REST interfaces. Producers establish a streaming connection to the Zerobus endpoint and send events commonly serialised using Protocol Buffers directly to the target Delta table

Behind the scenes, Zerobus integrates natively with Delta Lake for durable, transactional storage and Unity Catalog for access control and governance.

 Client SDKs in languages such as Python, Java, and Rust abstract the complexity of stream creation and record ingestion, allowing developers to implement scalable real-time pipelines with minimal configuration.

 anujsen18_0-1766508538746.png

 

 

gRPC: A high-performance communication protocol used by Zerobus to stream events from producers to Databricks with low latency and reliability. 
Protobuf: A compact, strongly typed data format that defines the event schema and efficiently serializes data before ingestion. 
Client: Runs in the application or device, packages events using Protobuf, and sends them to Zerobus using the gRPC API. 
Zerobus Server: A fully managed, serverless Databricks service that receives events, handles buffering and durability, and writes data directly into Delta tables.

 

Zerobus vs Kafka vs Auto Loader: Databricks Ingestion Comparison

Zerobus Ingest complements, rather than replaces, existing Databricks ingestion tools. Each ingestion option is designed for a different data source type, latency requirement, and operational model. Choosing the right ingestion pattern depends on whether data is produced as events, files, or database changes, as well as the level of real-time processing and infrastructure complexity required.

The following comparison summarises how Zerobus Ingest compares with Kafka, Auto Loader, and CDC pipelines in Databricks, and provides practical guidance on when each option is the best fit for real-time data ingestion workloads.

 

Zerobus Ingest
Best for: Direct real-time event ingestion
Data source type: Applications, IoT devices, services
Latency: Low (near real-time)
Operational effort: Very low (serverless)
When to choose: When events need to land directly in Delta tables with minimal infrastructure and Databricks is the primary destination
Auto Loader
Best for: Incremental file ingestion
Data source type: Files in cloud storage
Latency: Medium (micro-batch)
Operational effort: Low
When to choose: When data arrives as files or batches and near-real-time processing is not required
Kafka + Structured Streaming
Best for: Large-scale event streaming and complex processing
Data source type: Event streams via a message broker
Latency: Low
Operational effort: High
When to choose: When multiple consumers, message retention, or advanced stream processing is required
 CDC Pipelines
Best for: Database change data capture (CDC)
Data source type: Transactional databases
Latency: Medium to low
Operational effort: Medium
When to choose: When replicating database changes into the Lakehouse while maintaining row-level consistency

Ingestion Method Decision Tree

anujsen18_6-1766504369845.png

 

Ingesting Real time Messages using Zerobus Ingest 

Prerequisites & Environment Setup

Server (Workspace side)

  1. Zerobus Access: At the time of writing this article zerobus ingest is in public preview so if not enabled in your workspace please connect your Databricks account executives to get the help . 
    2. WorkspaceURLhttps://adb-XXXXXXXXXXXX.XX.azuredatabricks.net/"
    3. Zerobus server end point = “XXXXXXXXXXXX.zerobus.<REGION_NAME>.azuredatabricks.net”
    4. Target table name <UC.SCHEMA.TABLE> where data need to be written
    5. Service principal details (CLIENT_ID and CLIENT_SECRETE) find in workspace Settings > Identity and Access.

Client Side : 

  1. Preferred SDK (Python, Java, Rust)
    2. protobuff schema of table in case protobuff protocall being used.

Generating Protobuf (ignore this if you are using json)

In Zerobus:

The .proto file defines the schema of the events written to Delta tables.
Compiled client code (.py, .java, etc.) is used by producers to send data
This guarantees schema correctness, performance, and compatibility at ingestion time

Protobuf defines the data contract, and the compiled files provide the language-specific code needed to efficiently send and receive that data.

Get proto definition : 

python -m zerobus.tools.generate_proto \
    --uc-endpoint "$UC_ENDPOINT" \
    --client-id "$CLIENT_ID" \
    --client-secret "$CLIENT_SECRET" \
    --table "$TABLE_NAME" \
    --output "$OUTPUT_FILE"

this will generate XX.proto file which is Protocol Buffer definition. In next step we can use it to compile into language-specific proto 

python -m grpc_tools.protoc --python_out=. --proto_path=. record.proto

this will generate python complied proto definition which can be used by python SDK to serialised the message.

Sending messages using SDK 

load config : (give information to client about zero bus server)

anujsen18_0-1766502911635.png

 

Defining Configuration for Zerobus server

Create Stream : (open stream with zerobus server using config)

anujsen18_1-1766502911638.png

 

opening zerobus stream 

Send Records to Server  : (Async or Sync )

anujsen18_2-1766502911642.png

 

Send messages to Zerobus

The Final wrapper : (calling all together ) 

anujsen18_3-1766502911645.png

 

main runner function

Logging from Zerobus gives clear chain how it progressed we can also see some metric 

anujsen18_4-1766502911649.png

 

we can see ingested data in delta table :

anujsen18_5-1766502911656.png

 

once data loaded in target table it can be processed using Spark declarative pipelines in continuous mode to process changes to downstreams.

Example Zerobus Ingest client implementation on GitHub:-

GitHub - anujsen18/databricks_zerobus_client_example
Contribute to anujsen18/databricks_zerobus_clien...

 

Limitations and Pricing Note

Zerobus Ingest is currently in public preview and has defined throughput limits, with optimal performance when the client and endpoint run in the same region. It supports up to 100 MB/s or ~15,000 rows per second per stream and provides at-least-once delivery guarantees, requiring downstream handling of potential duplicates. While Zerobus usage is free during preview, Databricks plans to introduce pricing in the future, which should be considered for production planning.

Conclusion:

Zerobus Ingest simplifies real-time data ingestion in Databricks by allowing events to be written directly into Delta tables without relying on a traditional message broker. This approach reduces operational complexity, minimises latency, and enables faster analytics for event-driven and IoT workloads.

While Zerobus is ideal for direct event ingestion with Databricks as the primary destination, other patterns such as Kafka, Auto Loader, or CDC pipelines remain better suited for complex stream processing, file-based ingestion, or database replication. Selecting the right ingestion pattern depends on workload requirements, latency needs, and operational considerations.