cancel
Showing results for 
Search instead for 
Did you mean: 
Technical Blog
Explore in-depth articles, tutorials, and insights on data analytics and machine learning in the Databricks Technical Blog. Stay updated on industry trends, best practices, and advanced techniques.
cancel
Showing results for 
Search instead for 
Did you mean: 
Vicky_Bukta_DB
Databricks Employee
Databricks Employee

The Challenge: From IoT to Insights in Real-Time

Vicky_Bukta_DB_9-1771025262043.png

Picture this: It's 3 AM. Twenty sailboats are racing through rough Caribbean seas in the middle of a four-day regatta. Each boat has sensors that transmit GPS coordinates, wind speed, course heading, and performance metrics every second. That's 1.7 million data points per day.

There are real-time demands: Race officials need to see who's leading right now. Spectators on shore want to watch their favorite boats navigate around marks in real-time. After the race, race teams want to understand which strategies worked best. Was it better strategically to tack or maintain speed over longer distances?

Vicky_Bukta_DB_0-1771025187783.png

Notebook reviewing the performance of the boats based on distance sailed, wind conditions, and point of sail. The greater the velocity made good (VMG), the faster the boat is going to its destination.

Here's the traditional stack you'd need to support this challenge:

  • Apache Kafka for streaming telemetry.
  • Schema Registry for managing evolving data formats.
  • Stream processor (e.g., Apache Flink, Spark Streaming) for transformations and ingestion of our Kafka data as it arrives.
  • Database for powering a front-end leaderboard and real-time tracking.
  • Custom web application with its own auth, scaling, and monitoring.
  • Data orchestrator and analytical query engine for post-race analysis.

Each layer adds latency, operational complexity, and failure modes. You're looking at weeks of integration work before you write a single line of business logic.

But what if there was a better way? What if you could push data directly to your lakehouse, deploy interactive apps in minutes, and run analytics on the same platform—all fully managed with unified governance?

Here’s how we built exactly that using Zerobus Ingest, Databricks Asset Bundles (DABs), and Databricks Apps.

Exhibit A: Data Drifter Regatta

We built Data Drifter Regatta, a sailing simulator that demonstrates how Databricks’ modern data tools eliminate the complexity of real-time analytics. The application simulates a fleet of sailboats generating telemetry data (e.g., GPS coordinates, wind conditions, speed, VMG) and visualizes their race progress in real-time through an interactive web application.

The entire stack—data ingestion, storage, processing, analytics, and visualization—runs on Databricks, orchestrated by a single deployment command.

Let's walk through how three key Databricks tools made this possible.

1. Lakeflow Connect with Zerobus Ingest: Rethinking Streaming

What Is Zerobus Ingest? (And What It Isn't)

Zerobus Ingest is a serverless, push-based ingestion API that writes data directly into Unity Catalog Delta tables. This endpoint is explicitly designed for high-throughput streaming writes to your lakehouse.

  • Over 10GB per second throughput to a table.
  • Up to 100MB per second throughput per connection.
  • Built for highly concurrent workloads.
  • Scale with connections, not partitions.

Zerobus Ingest is not a message bus. So, you don’t need to worry about Kafka, nor publishing to topics, scaling partitions, managing consumers, or scheduling backfills.

When your ultimate destination is the Lakehouse, Zerobus Ingest empowers you to eliminate the message bus, paving the way for a streamlined, single-hop architecture that dramatically reduces costs, complexity, and maintenance.

Vicky_Bukta_DB_1-1771025187784.png

How Zerobus Ingest Works

Zerobus Ingest handles all the complexity of streaming to lakehouse writes, without jobs to tune, message brokers to scale, or infrastructure to manage.

Vicky_Bukta_DB_2-1771025187785.png

When you send data to Zerobus Ingest, you:

  1. Open a stream to the target table: Authenticate with Zerobus Ingest endpoint using OAuth. Specify the target Unity Catalog table.
  2. Send data: Call ingest_record() or ingest_batch() using the SDK, or make a REST POST call.
  3. Receive acknowledgment: Receive an acknowledgement signalling that your data is durable.

Zerobus Ingest handles the rest, including:

  1. Schema validation: Deserializes and validates that the message can be written to the target table.
  2. Buffering: Buffers messages internally.
  3. Landing data in Unity Catalog: Ensures that data is queryable within seconds and governed by Unity Catalog.

Serverless and Horizontally Scalable

Zerobus Ingest is fundamentally different from traditional streaming platforms, where scaling means:

  • Adding brokers
  • Scaling partitions
  • Managing consumer groups
  • Handling backpressure

With Zerobus Ingest, your "scaling strategy" is to simply open more connections.

In this “Data Drifter Regatta” demo, we demonstrate a single connection pushing 20 records/second (all boats in one stream). You could easily expand this to scale for:

  • Distributed producers: 20 connections each pushing 1 record/second (one per boat).
  • Enterprise scale: Thousands of devices each pushing their own data.

All these scenarios require zero infrastructure changes. Zerobus Ingest automatically scales to handle the incoming connections. You don't configure partitions, and you don't manage brokers. You simply push data. Joby Aviation, an early Zerobus Ingest adopter, transformed its data infrastructure—streaming gigabytes per minute from thousands of devices directly to the lakehouse, all with sub-five-second delivery. Read the case study.

Multi-Protocol Flexibility: REST API for IoT Devices

Zerobus Ingest supports multiple ingestion protocols, giving you the flexibility to choose the right approach for your use case:

  • SDK with gRPC: High-performance, bi-directional streaming for application-to-application communication (what we showed above)
  • REST API: Maximum flexibility for IoT devices, edge sensors, and simpler integrations where you can't or don't want to use an SDK

Vicky_Bukta_DB_3-1771025187786.png

Zerobus Ingest’s multi-protocol approach provides the flexibility to support our diverse data producers.

Support for both gRPC and REST APIs in IoT and edge deployments offers flexibility for various device capabilities, with the REST API easily integrating with any device that can make HTTP requests. Both protocols ensure reliable performance and automatic schema management, making them ideal for resource-constrained devices and legacy systems.

Choose what fits your architecture and device constraints.

gRPC with SDK: From Boat Sensor to the Lakehouse

Let's look at the code we used to ingest sailboat telemetry:

from zerobus.sdk.aio import ZerobusSdk
from zerobus.sdk.shared import StreamDestination

# Initialize SDK with OAuth service principal

sdk = ZerobusSdk(
    server_endpoint="your-endpoint.zerobus.databricks.com",
    client_id=OAUTH_CLIENT_ID,
    client_secret=OAUTH_CLIENT_SECRET
)


# Create a stream targeting your Unity Catalog table
stream = await sdk.create_stream(
    destination=StreamDestination(
        table_name="main.sailboat.telemetry"
    )
)



# Generate telemetry data for each boat
for boat in fleet.boats:
    telemetry = {
        "boat_id": boat.id,
        "boat_name": boat.name,
        "timestamp": int(time.time() * 1000000),  # microseconds
        "latitude": boat.latitude,
        "longitude": boat.longitude,
        "speed_over_ground_knots": boat.speed,
        "wind_speed_knots": wind.speed,
        "heading_degrees": boat.heading,
        # ... 20+ more fields
    }

    # Push to Zerobus - that's it!
    await stream.ingest_record(telemetry)

That's the complete ingestion pipeline: 29 fields of nested telemetry data, streaming at 20 records/second, landing in Delta Lake—with only 3 function calls!

REST API: Weather Station IoT Sensor Ingestion 

For this Data Drifter demo, we added a weather station that simulates an IoT device, reporting wind conditions via the Zerobus Ingest REST API. The weather station only emits data when conditions change significantly—demonstrating how smart devices can implement their own logic before sending data:

import requests

# Weather station prepares data from sensors
weather_data = [{
    "station_id": "race-course-01",
    "timestamp": int(time.time() * 1_000_000),
    "wind_speed_knots": 15.3,
    "wind_direction_degrees": 45.0,
    "event_type": "frontal_passage",
    "in_transition": True,
    # ... additional weather metrics
}]

# Send to Zerobus REST API
response = requests.post(
    f"https://{endpoint}/zerobus/v1/tables/{table}/insert",
    json=weather_data,
    headers={
        "Authorization": f"Bearer {oauth_token}"
    }
)

Zerobus Ingest is Fully Managed

Behind a simple API call, Zerobus Ingest is doing the heavy lifting and handling key operational tasks automatically:

1. Schema Management

  • Infers schema from the record.
  • Gracefully handles differences between stream and table schema.
  • Validates incoming data against the current schema to prevent the ingestion of “bad” data.
  • Manages type conversions (timestamps, decimals, arrays, variant type).

2. Optimization

  • Managed serverless infrastructure.
  • Manages commit frequency to balance latency vs. throughput.
  • Supports z-ordering, liquid clustering, and partitioning for query performance.
  • With predictive optimization integration, files are analyzed, compacted, and clustered asynchronously.

3. Reliability

  • Provides at-least-once delivery semantics.
  • Provides automatic retries with backoff.

4. Security

  • Offers OAuth with Databricks service principals.
  • Enforces Unity Catalog governance on writes.

All of this is made possible by being integrated into the Databricks platform, with Unity Catalog at the center.

Near Real-Time Performance: What We Observed

Here’s the results of our testing with Data Drifter Regatta:

Metric

Value

Records per second

20 (one per boat)

Average latency

< 500ms

Data size

~2KB per record (29 fields)

Throughput

~40KB/sec continuous

Infrastructure managed

0 servers, 0 clusters, 0 configs

Code complexity

3 function calls

When we scaled to 100 boats (100 records/second), latency stayed flat. When we added batch ingestion (100 records per API call), throughput increased 10x with the same code.

The Developer Experience Difference

What we did:

What we didn't have to do:

Call ingest_record() with a Python dict 

Write SQL queries against Unity Catalog

Set up Kafka brokers 

Configure partitions and replication 

Write schema registry definitions 

Build Spark Streaming jobs 

Tune batch intervals

Monitor consumer lag 

Handle small file compaction 

Set up dead letter queues 

Configure backpressure handling

The difference is transformative. We spent our time building race visualization and analytics, not debugging streaming infrastructure.

Key Takeaways: Why Zerobus Ingest Matters

  1. It's an API, not infrastructure: No clusters to manage; just API calls.
  2. Serverless scaling: Handles 10 records/sec or 10,000/sec transparently.
  3. Built on the Lakehouse: Writes directly to Delta; no separate storage layer.
  4. Fully integrated with Unity Catalog: Unified governance and lineage.
  5. Developer-first: If you can call an API, you can stream data to your Lakehouse.

For our sailboat use case, Zerobus Ingest allowed us to focus on the interesting problem—simulating realistic racing dynamics—rather than managing streaming infrastructure. That's the promise of serverless done right.

2. Databricks Apps: From Query to Interactive UI in Minutes

Once we had streaming data landing in our Delta tables, we needed a way to visualize it in real-time. We leveraged Databricks Apps to query our incoming data to create a race visualization and leaderboard.

Vicky_Bukta_DB_4-1771025187787.png

What Are Databricks Apps?

Databricks Apps let you deploy interactive web applications directly in your Databricks workspace. The app runs on managed compute with zero infrastructure setup. Apps work as part of the unified Databricks platform, making it seamless to connect and access key assets such as our storage and SQL compute.

Vicky_Bukta_DB_5-1771025187787.png

Databricks Apps can utilize your SQL compute to run queries against your storage, leverage a single copy of data. 

Our Implementation

We built our race tracker using Streamlit, with an interactive map powered by Folium:

# Query Unity Catalog directly
query = """
SELECT boat_id, boat_name, latitude, longitude,
       speed_over_ground_knots, heading_degrees,
       distance_to_destination_nm, race_status
FROM main.default.sailboat_telemetry
WHERE timestamp > (
SELECT MAX(timestamp) - 300000000 
FROM main.default.sailboat_telemetry
)
ORDER BY timestamp DESC
"""

# Execute with SQL warehouse
cursor = conn.cursor()
cursor.execute(query)
df = cursor.fetch_all_arrow().to_pandas()

# Visualize on interactive map
for _, boat in df.groupby('boat_id').last().iterrows():
    folium.Marker(
        [boat['latitude'], boat['longitude']],
        popup=f"{boat['boat_name']}: {boat['speed_over_ground_knots']:.1f} knots",
        icon=folium.plugins.BoatMarker(rotation=boat['heading_degrees'])
    ).add_to(race_map)

Resource Permissions

When you create an app, it automatically provisions a service principal. You can then grant that principal access to specific resources:

{
  "resources": [
    {
      "name": "sql-warehouse",
      "sql_warehouse": {
        "id": "warehouse-id",
        "permission": "CAN_USE"
      }
    }
  ]
}

The app's service principal gets automatic permissions to:

  • Query the telemetry table
  • Use the SQL warehouse
  • Access the data it needs—nothing more, nothing less

No API keys, no credential management, no security headaches. Prototyping with Databricks Apps was a breeze. In summary,

  1. Zero infrastructure: No compute instances or scaling to manage yourself.
  2. Resource governance: Fine-grained permissions achieved through Unity Catalog.
  3. Live updates: Deploy changes with a single command.

3. Databricks Asset Bundles: Infra as Code, Done Right

Here is the following challenge: post-race analysis.  How do we coordinate the deployment of our pipeline and jobs? We want to understand the performance of the different boats and their progress through the races, all in the context of changing weather conditions.

Vicky_Bukta_DB_6-1771025187788.png

Notebooks report reviewing the changing wind conditions over the course of the race.

Bundles

Databricks Asset Bundles let you define your Databricks project, such as jobs, notebooks, and resources, all in a single YAML file and deploy it with one command.

Using Bundles, we can achieve:

  1. Environment parity: Same configuration for dev/staging/prod defined in a yml file.
  2. Version control: Infrastructure changes are committed in Git.
  3. Atomic deployments: All resources are deployed as defined (or referenced) in the databricks.yaml file.
  4. Parameterization: Variables for different environments.
  5. Dependency management: Jobs run in the correct order and block on dependencies automatically.

Your project's databricks.yml file defines Data Drifter Application Bundles. In our case, this file sets up a job with a sequence of tasks designed to analyze the race results.

bundle:
  name: sailboat_telemetry_analysis

variables:
  config_file_path:
    default: ../config.toml
  warehouse_id:
    default: "warehouse-id-here"

resources:
  jobs:
    sailboat_analysis:
      name: "Sailboat Performance Analysis"
      tasks:
        - task_key: boat_performance
          notebook_task:
            notebook_path: ./notebooks/01_boat_performance.py
            source: WORKSPACE
        - task_key: wind_conditions
          notebook_task:
            notebook_path: ./notebooks/02_wind_conditions.py
            source: WORKSPACE
          depends_on:
            - task_key: boat_performance
        - task_key: race_progress
          notebook_task:
            notebook_path: ./notebooks/03_race_progress.py
            source: WORKSPACE
          depends_on:
            - task_key: boat_performance
        - task_key: race_summary
          notebook_task:
            notebook_path: ./notebooks/04_race_summary.py
            source: WORKSPACE
          depends_on:
            - task_key: wind_conditions
            - task_key: race_progress
targets:
  dev:
    mode: development
    workspace:
      host: https://your-workspace.cloud.databricks.com
  prod:
    mode: production
    workspace:
      host: https://prod-workspace.cloud.databricks.com

The result is a pipeline that looks like this, with a chain of dependencies that run before compiling everything into a final race summary.

Vicky_Bukta_DB_7-1771025187788.png

The Complete Architecture: How It All Fits Together

The Data Drifter Regatta architecture uses Zerobus Ingest to push real-time sailboat and weather station data to Delta Lake tables. 

This Delta open format storage layer then powers downstream processes: a Streamlit app for live race visuals, and Databricks Notebooks for post-race analysis orchestrated through DABs. 

Its simplicity enables data producers to write to Delta Lake once, allowing multiple consumers—such as a real-time UI and batch analytics—to access the same tables without separate pipelines or complex orchestration.

Vicky_Bukta_DB_8-1771025187789.png

 

The Databricks Platform made building Data Drifter simple and streamlined, eliminating the need to integrate with multiple vendors or self-host open source components. 

  • Simplicity scales: Eliminated 3 separate systems (streaming platform, database, web server), infrastructure management, and complex credential management.
  • Developer-first experience: Seamless integration meant less context switching, faster iteration, fewer integration bugs, and more time focusing on business logic.
  • Governance comes free: Unity Catalog integration provides automatic lineage tracking, fine-grained access control, audit logs, and no separate IAM management.
  • Easy production deployment: The same tools that enabled quick prototyping made it easy to transition to production deployments.

The Future of Real-Time Analytics

Building Data Drifter Regatta taught us that platform unification enables application breadth.

A unified platform means:

  • No integration tax: Ingest, apps, and analytics work together out of the box.
  • Single governance model: Unity Catalog is our source of truth for governance.
  • Single deployment: Asset Bundles orchestrate the entire stack.

Across industries, converging trends underscore the need for a unified platform to accelerate development.

  • IoT explosion: Billions of devices generating streaming data.
  • Real-time expectations: Building live dashboards, not just batch reports on the same data.
  • AI/ML integration: Analytical and operations data feeds AI/ML models.
  • Compliance pressure: Governance can't be “bolted” on; it must be built in.
  • Cloud costs: Running 10 services costs 10x, both in $ and complexity.

The old approach—best-of-breed tools stitched together by dedicated engineering and infrastructure teams—was a liability. Now, a unified, simplified platform lets you focus on business logic—your market differentiator—rather than managing infrastructure.

One platform, zero manual integration, infinite possibilities.

Getting Started: Try It Yourself

The complete Data Drifter Regatta code is available as a reference implementation. We combined our telemetry generator, Databricks App, and analysis pipeline into a simple deployment script to get you started. 

# 1. Clone the repository

git clone databricks-solutions/zerobus-ingest-examples
cd data_drifter

# 2. Configure (just update config.toml with your credentials)

# 3. Deploy everything with one command

./deploy.sh

# 4. Start generating telemetry
python main.py --client-id <client-id> --client-secret your-secret

Try it yourself: Clone the complete Data Drifter Regatta code and see how quickly you can go from "I have streaming data" to "I have a production application."

Additional Resources