cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Reading a protobuf file in a Databricks notebook

Fiona
New Contributor II

I have proto files (offline data storage) that I'd like to read from a Databricks notebook. I found this documentation (https://docs.databricks.com/structured-streaming/protocol-buffers.html), but it only covers how to read the protobuf data once the binary is already in a DataFrame. How do I read the binary data in in the first place?

1 ACCEPTED SOLUTION

Accepted Solutions

Priyanka_Biswas
Valued Contributor
Valued Contributor

Hi @Fiona 
To use Protobuf with a descriptor file, you can reference the file that is available to your compute cluster. Here are the steps to do so:

1. Import the necessary functions:

from pyspark.sql.protobuf.functions import to_protobuf, from_protobuf

2. Specify the path to the descriptor file:

descriptor_file = "/path/to/proto_descriptor.desc"

3. Use from_protobuf() to cast a binary column to a struct:

proto_events_df = input_df.select(from_protobuf(input_df.value, "BasicMessage", descFilePath=descriptor_file).alias("proto"))

4. Use to_protobuf() to cast a struct column to binary:

proto_binary_df = proto_events_df.select(to_protobuf(proto_events_df.proto, "BasicMessage", descriptor_file).alias("bytes"))

Sources:
https://docs.databricks.com/structured-streaming/protocol-buffers.html

View solution in original post

3 REPLIES 3

Priyanka_Biswas
Valued Contributor
Valued Contributor

Hi @Fiona 
To use Protobuf with a descriptor file, you can reference the file that is available to your compute cluster. Here are the steps to do so:

1. Import the necessary functions:

from pyspark.sql.protobuf.functions import to_protobuf, from_protobuf

2. Specify the path to the descriptor file:

descriptor_file = "/path/to/proto_descriptor.desc"

3. Use from_protobuf() to cast a binary column to a struct:

proto_events_df = input_df.select(from_protobuf(input_df.value, "BasicMessage", descFilePath=descriptor_file).alias("proto"))

4. Use to_protobuf() to cast a struct column to binary:

proto_binary_df = proto_events_df.select(to_protobuf(proto_events_df.proto, "BasicMessage", descriptor_file).alias("bytes"))

Sources:
https://docs.databricks.com/structured-streaming/protocol-buffers.html

Fiona
New Contributor II

Hi! Yeah, I think I understand everything about that, but I don't know how to create "input_df" given a file of multiple protobuf records, if that makes sense

StephanK
New Contributor II

If you have proto files in offline data storage, you should be able to read them with:

input_df = spark.read.format("binaryFile").load(data_path)