Solved: Reading a protobuf file in a Databricks notebook - Databricks Community - 38295

Register to join the community

Data Engineering

Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.

I have proto files (offline data storage) that I'd like to read from a Databricks notebook. I found this documentation (https://docs.databricks.com/structured-streaming/protocol-buffers.html), but it only covers how to read the protobuf data once the binary is already in a DataFrame. How do I read the binary data in in the first place?

1 ACCEPTED SOLUTION

Accepted Solutions

Hi @Fiona
To use Protobuf with a descriptor file, you can reference the file that is available to your compute cluster. Here are the steps to do so:

1. Import the necessary functions:

from pyspark.sql.protobuf.functions import to_protobuf, from_protobuf

2. Specify the path to the descriptor file:

descriptor_file = "/path/to/proto_descriptor.desc"

3. Use from_protobuf() to cast a binary column to a struct:

proto_events_df = input_df.select(from_protobuf(input_df.value, "BasicMessage", descFilePath=descriptor_file).alias("proto"))

4. Use to_protobuf() to cast a struct column to binary:

proto_binary_df = proto_events_df.select(to_protobuf(proto_events_df.proto, "BasicMessage", descriptor_file).alias("bytes"))

Sources:
- https://docs.databricks.com/structured-streaming/protocol-buffers.html

View solution in original post

3 REPLIES 3

Hi @Fiona
To use Protobuf with a descriptor file, you can reference the file that is available to your compute cluster. Here are the steps to do so:

1. Import the necessary functions:

from pyspark.sql.protobuf.functions import to_protobuf, from_protobuf

2. Specify the path to the descriptor file:

descriptor_file = "/path/to/proto_descriptor.desc"

3. Use from_protobuf() to cast a binary column to a struct:

proto_events_df = input_df.select(from_protobuf(input_df.value, "BasicMessage", descFilePath=descriptor_file).alias("proto"))

4. Use to_protobuf() to cast a struct column to binary:

proto_binary_df = proto_events_df.select(to_protobuf(proto_events_df.proto, "BasicMessage", descriptor_file).alias("bytes"))

Sources:
- https://docs.databricks.com/structured-streaming/protocol-buffers.html

Hi! Yeah, I think I understand everything about that, but I don't know how to create "input_df" given a file of multiple protobuf records, if that makes sense

If you have proto files in offline data storage, you should be able to read them with:

input_df = spark.read.format("binaryFile").load(data_path)

never-displayed

You must be signed in to add attachments

never-displayed

Announcements

Introducing an exclusively Databricks-hosted Assistant

How to present and share your Notebook insights in AI/BI Dashboards

Meet the Databricks MVPs

Now Hiring: Databricks Community Technical Moderator

Insights from a global survey of 1,100 technologists and interviews with 28 CIOs