cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Reading a protobuf file in a Databricks notebook

Fiona
New Contributor II

I have proto files (offline data storage) that I'd like to read from a Databricks notebook. I found this documentation (https://docs.databricks.com/structured-streaming/protocol-buffers.html), but it only covers how to read the protobuf data once the binary is already in a DataFrame. How do I read the binary data in in the first place?

1 ACCEPTED SOLUTION

Accepted Solutions

Priyanka_Biswas
Valued Contributor
Valued Contributor

Hi @Fiona 
To use Protobuf with a descriptor file, you can reference the file that is available to your compute cluster. Here are the steps to do so:

1. Import the necessary functions:

from pyspark.sql.protobuf.functions import to_protobuf, from_protobuf

2. Specify the path to the descriptor file:

descriptor_file = "/path/to/proto_descriptor.desc"

3. Use from_protobuf() to cast a binary column to a struct:

proto_events_df = input_df.select(from_protobuf(input_df.value, "BasicMessage", descFilePath=descriptor_file).alias("proto"))

4. Use to_protobuf() to cast a struct column to binary:

proto_binary_df = proto_events_df.select(to_protobuf(proto_events_df.proto, "BasicMessage", descriptor_file).alias("bytes"))

Sources:
https://docs.databricks.com/structured-streaming/protocol-buffers.html

View solution in original post

3 REPLIES 3

Priyanka_Biswas
Valued Contributor
Valued Contributor

Hi @Fiona 
To use Protobuf with a descriptor file, you can reference the file that is available to your compute cluster. Here are the steps to do so:

1. Import the necessary functions:

from pyspark.sql.protobuf.functions import to_protobuf, from_protobuf

2. Specify the path to the descriptor file:

descriptor_file = "/path/to/proto_descriptor.desc"

3. Use from_protobuf() to cast a binary column to a struct:

proto_events_df = input_df.select(from_protobuf(input_df.value, "BasicMessage", descFilePath=descriptor_file).alias("proto"))

4. Use to_protobuf() to cast a struct column to binary:

proto_binary_df = proto_events_df.select(to_protobuf(proto_events_df.proto, "BasicMessage", descriptor_file).alias("bytes"))

Sources:
https://docs.databricks.com/structured-streaming/protocol-buffers.html

Fiona
New Contributor II

Hi! Yeah, I think I understand everything about that, but I don't know how to create "input_df" given a file of multiple protobuf records, if that makes sense

StephanK
New Contributor II

If you have proto files in offline data storage, you should be able to read them with:

input_df = spark.read.format("binaryFile").load(data_path)

 

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!