Today’s modern vehicles generate large streams of data which can be harnessed in a data-driven manner to boost efficiency, optimize costs, and minimize our environmental impact in real-time. By leveraging this data across a fleet of vehicles at scale, we can achieve even greater benefits, enabling more effective fleet management and optimization.
When it comes to building real-time streaming applications from IoT devices, including vehicle fleet telemetry and other edge devices, Protobuf (Protocol Buffers) has emerged as a popular choice for companies. Its compact, efficient binary format ensures low latency and high throughput, making it ideal for handling critical data, enabling real-time decision-making, and optimizing operations.
Traditional approaches to managing Protobuf events can often introduce a heavy burden on Data Engineering teams, requiring specialized skills that are rare and costly. This increases the ongoing expenses related to maintaining proto packages, system upgrades, and infrastructure. Additionally, data integrity issues during event replays add further complexity, making it challenging for businesses to manage streaming applications efficiently and cost-effectively. This highlights the need for a simple, reliable, and performant solution that minimizes the dependency on technical resources.
This blog explores how you can harness the capabilities of Serverless Delta Live Tables (DLT) pipelines to transform raw IoT telemetry data from fleets into powerful insights. With native support, Databricks greatly simplifies the process of serializing and de-serializing Protobuf events to enable these insights. Furthermore, Serverless ensures your data teams can focus on high-value activities like extracting insights while Databricks efficiently manages compute resources, including optimizing and scaling compute for workloads.
Table of Contents
Company XYZ is a hypothetical logistics provider created to demonstrate real-time vehicle fleet analytics with Databricks Delta Live Tables. They operate thousands of vehicles across various routes daily, facing challenges in optimizing fuel efficiency and monitoring vehicle utilization. Excessive idle times and unmonitored fuel consumption can lead to higher operational costs.
To address these challenges, they leverage real-time data analytics using Databricks Delta Live Tables (DLT) to gain actionable insights from their fleet telemetry data, enabling them to optimize vehicle usage and reduce fuel expenses effectively.
IoT devices in vehicles can transmit telemetry data such as location, speed, fuel level, and engine status. These data points can power insights into fuel efficiency and vehicle utilization, which are essential metrics for optimizing fleet operations.
The high-level architecture diagram below illustrates the end-to-end implementation of a streaming solution using Kafka and the Databricks Serverless DLT pipeline.
A data producer generates fleet telemetry events into a Kafka topic in the Protobuf format. In real-world architectures, this data producer is replaced by the vehicle fleet telemetry IoT devices, which send events to a centralized system such as gRPC Server. This system packages and compacts the individual events into batch messages before transmitting them to Kafka. The Serverless DLT pipeline continuously ingests those events from the Kafka topic, deserializes them using the from_protobuf function, and populates the following tables in the medallion architecture.
These Gold materialized views are critical for downstream analytics. They provide insights into fuel efficiency and vehicle utilization, which are essential metrics for optimizing fleet operations.
Note! This blog uses Kafka to efficiently manage real-time fleet telemetry data, ensuring reliable streaming and handling of Protobuf-serialized messages. |
When you build your DLT pipeline, you will be presented with a graphical and visual representation of the flows and dependencies in a Directed Acyclic Graph (DAG). This helps track the flow of data from raw ingestion to the final stages of processing.
The Protobuf deserialization occurs while populating the bronze_fleet_events, which serves as a source for performing change data capture into the silver table silver_fleet_events. Refer to the detailed sections below for each table to understand the data flow within this pipeline.
Leveraging the Serverless option for Delta Live Tables offers several key advantages, making it an ideal choice for efficiently processing Protobuf data without the overhead of managing infrastructure. It provides an easy-to-use and simple experience and builds a performant, auto-optimized, and fully managed data pipeline that will decrease overall TCO. You can enable Serverless DLT by navigating to the pipeline settings and ticking the "Serverless" option.
For more detailed insights on the benefits of using serverless DLT, including how it enables cost-effective, incremental ETL pipelines, refer to this latest blog: "Cost-effective, incremental ETL with serverless compute for Delta Live Tables pipelines".
Protobuf messages are often transmitted in consolidated batches rather than as individual events in high-throughput scenarios. They are prone to data repetition due to replaying past events from edge devices.
Data is typically structured using a Protobuf schema defined in the .proto file. An example of one is shown below. The schema includes two message types: fleetevent and fleetevents.
fleet_management.proto
|
To optimize data transmission, individual fleet events (fleetevent) are consolidated into batches (fleetevents), meaning several events are combined into a single message and then serialized into the Protobuf format before sent to Kafka.
Here is an example of a raw Protobuf payload retrieved from the Kafka topic. This payload contains a batch of multiple events, each corresponding to individual telemetry readings from a vehicle.
|
The raw Protobuf data is ingested into the Bronze layer in two steps:
The Bronze raw streaming table is where raw fleet telemetry events are ingested from Kafka in Protobuf binary format and ready for further processing. In the DLT pipeline, raw events are ingested into the bronze_fleet_events_raw using the Spark readStream method. This method continuously ingests streaming data from the Kafka topic in real time.
Streaming tables in DLT are stateful, incrementally computed, and only process data added since the last pipeline run. The readStream query that defines the bronze_fleet_events_raw is provided in the corresponding function.
|
Sample result: bronze_fleet_events_raw contains the raw Protobuf messages ingested from Kafka in the value column.
Deserialization converts the Protobuf binary data into readable structured data that can be processed and queried within the pipeline. The function definition of the Bronze table bronze_fleet_events consists of the Protobuf deserialization logic using the from_protobuf function.
The value column (containing the payload) and schema_registry_options are passed as parameters to the from_protobuf function, which deserializes the payload into a struct. The explode function is used here to flatten the batch messages (fleetevents) into the individual event (fleetevent) records for further processing. Kafka metadata fields such as key, partition, offset, and timestamp are retained in this table to maintain traceability from the raw Kafka data.
|
Sample result: bronze_fleet_events contains the deserialized data
The from_protobuf function by default uses FAILFAST parsing mode, which will immediately halt the pipeline and raise an error if any malformed records are encountered during the deserialization process. Refer to the Databricks documentation for more details.
To use the Protobuf functions from_protobuf and to_protobuf, you must provide either a schema registry via the options argument or a descriptor file identified by the descFilePath argument.
The from_protobuf and to_protobuf functions are available on Databricks Runtime 12.2 LTS and above. The basic syntax for these functions is as follows:
|
It is considered best practice to use a Schema Registry as it ensures centralized management and versioning of schemas. This provides a robust mechanism for maintaining consistency and compatibility across producers and consumers in the data pipeline. The functions simplify the serialization and deserialization processes by leveraging schema stored in the schema registry, eliminating the need to write the custom serialization and deserialization logic manually.
Sensitive information, such as credentials for Kafka brokers and the Schema Registry, can be securely managed using Databricks Secrets.
The silver_fleet_events table handles event replays from edge devices and ensures data integrity by storing the latest event data using an SCD Type-1 approach. SCD Type-1 is used here to store only the latest version of events, ensuring that any replayed or previously sent events from edge devices are correctly updated in the table. The apply_changes function in the DLT pipeline simplifies the change data capture (CDC) process. This function requires specifying the source, target, keys, sequencing for the change feed, and SCD-Type.
Since edge devices can transmit previously sent events into the Kafka topic again, the Silver table silver_fleet_events is configured as an SCD type-1 table using the apply_changes function. When a previously sent event is ingested again into the DLT pipeline, the corresponding entry in the Silver table gets updated. This table excludes Kafka-specific metadata columns (key, partition, offset, and timestamp) and focuses on business-relevant event data for further processing.
|
Sample result: silver_fleet_events is an SCD Type-1 table that ensures only the latest fleet telemetry events are stored.
Lastly, the Silver table is used to populate two Gold summary materialized views, all within the same DLT pipeline.
|
Sample results:
Protobuf deserialization is a crucial step in processing high-volume, real-time data streams, especially when dealing with consolidated messages from edge devices. By leveraging the native from_protobuf function within a Serverless DLT pipeline, data engineering teams can simplify the deserialization process, reduce infrastructure overhead, improve deserialization performance, and minimize maintenance efforts.
The Serverless nature of DLT further enhances this approach by eliminating the need for infrastructure management, allowing teams to focus solely on data processing and analytics. This streamlines the ingestion and processing of complex Protobuf messages and ensures that your data pipelines are scalable, cost-efficient, and easy to manage. Using declarative syntax in DLT, you can configure SCD Type 1 and 2 and other ETL needs (such as Data Quality Expectations) to build highly efficient ETL pipelines that support real-time analytics and decision-making.
As streaming data continues to play a pivotal role across industries, adopting best practices like these in your ETL pipelines will be essential for maintaining a competitive edge and optimizing operational efficiency. Following the steps outlined in this blog, you can confidently implement Protobuf deserialization in your Serverless Databricks DLT pipelines, ensuring that your streaming applications are optimized for performance, scalability, and accuracy.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.