cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Reading a protobuf file in a Databricks notebook

Fiona
New Contributor II

I have proto files (offline data storage) that I'd like to read from a Databricks notebook. I found this documentation (https://docs.databricks.com/structured-streaming/protocol-buffers.html), but it only covers how to read the protobuf data once the binary is already in a DataFrame. How do I read the binary data in in the first place?

1 ACCEPTED SOLUTION

Accepted Solutions

Priyanka_Biswas
Esteemed Contributor III
Esteemed Contributor III

Hi @Fiona 
To use Protobuf with a descriptor file, you can reference the file that is available to your compute cluster. Here are the steps to do so:

1. Import the necessary functions:

from pyspark.sql.protobuf.functions import to_protobuf, from_protobuf

2. Specify the path to the descriptor file:

descriptor_file = "/path/to/proto_descriptor.desc"

3. Use from_protobuf() to cast a binary column to a struct:

proto_events_df = input_df.select(from_protobuf(input_df.value, "BasicMessage", descFilePath=descriptor_file).alias("proto"))

4. Use to_protobuf() to cast a struct column to binary:

proto_binary_df = proto_events_df.select(to_protobuf(proto_events_df.proto, "BasicMessage", descriptor_file).alias("bytes"))

Sources:
https://docs.databricks.com/structured-streaming/protocol-buffers.html

View solution in original post

3 REPLIES 3

Priyanka_Biswas
Esteemed Contributor III
Esteemed Contributor III

Hi @Fiona 
To use Protobuf with a descriptor file, you can reference the file that is available to your compute cluster. Here are the steps to do so:

1. Import the necessary functions:

from pyspark.sql.protobuf.functions import to_protobuf, from_protobuf

2. Specify the path to the descriptor file:

descriptor_file = "/path/to/proto_descriptor.desc"

3. Use from_protobuf() to cast a binary column to a struct:

proto_events_df = input_df.select(from_protobuf(input_df.value, "BasicMessage", descFilePath=descriptor_file).alias("proto"))

4. Use to_protobuf() to cast a struct column to binary:

proto_binary_df = proto_events_df.select(to_protobuf(proto_events_df.proto, "BasicMessage", descriptor_file).alias("bytes"))

Sources:
https://docs.databricks.com/structured-streaming/protocol-buffers.html

Fiona
New Contributor II

Hi! Yeah, I think I understand everything about that, but I don't know how to create "input_df" given a file of multiple protobuf records, if that makes sense

StephanK
New Contributor II

If you have proto files in offline data storage, you should be able to read them with:

input_df = spark.read.format("binaryFile").load(data_path)

 

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group