โ07-24-2023 09:01 AM
I have proto files (offline data storage) that I'd like to read from a Databricks notebook. I found this documentation (https://docs.databricks.com/structured-streaming/protocol-buffers.html), but it only covers how to read the protobuf data once the binary is already in a DataFrame. How do I read the binary data in in the first place?
โ07-26-2023 03:26 PM - edited โ07-26-2023 03:27 PM
Hi @Fiona
To use Protobuf with a descriptor file, you can reference the file that is available to your compute cluster. Here are the steps to do so:
1. Import the necessary functions:
from pyspark.sql.protobuf.functions import to_protobuf, from_protobuf
2. Specify the path to the descriptor file:
descriptor_file = "/path/to/proto_descriptor.desc"
3. Use from_protobuf() to cast a binary column to a struct:
proto_events_df = input_df.select(from_protobuf(input_df.value, "BasicMessage", descFilePath=descriptor_file).alias("proto"))
4. Use to_protobuf() to cast a struct column to binary:
proto_binary_df = proto_events_df.select(to_protobuf(proto_events_df.proto, "BasicMessage", descriptor_file).alias("bytes"))
Sources:
- https://docs.databricks.com/structured-streaming/protocol-buffers.html
โ07-26-2023 03:26 PM - edited โ07-26-2023 03:27 PM
Hi @Fiona
To use Protobuf with a descriptor file, you can reference the file that is available to your compute cluster. Here are the steps to do so:
1. Import the necessary functions:
from pyspark.sql.protobuf.functions import to_protobuf, from_protobuf
2. Specify the path to the descriptor file:
descriptor_file = "/path/to/proto_descriptor.desc"
3. Use from_protobuf() to cast a binary column to a struct:
proto_events_df = input_df.select(from_protobuf(input_df.value, "BasicMessage", descFilePath=descriptor_file).alias("proto"))
4. Use to_protobuf() to cast a struct column to binary:
proto_binary_df = proto_events_df.select(to_protobuf(proto_events_df.proto, "BasicMessage", descriptor_file).alias("bytes"))
Sources:
- https://docs.databricks.com/structured-streaming/protocol-buffers.html
โ07-28-2023 06:36 AM
Hi! Yeah, I think I understand everything about that, but I don't know how to create "input_df" given a file of multiple protobuf records, if that makes sense
โ09-11-2023 01:00 PM
If you have proto files in offline data storage, you should be able to read them with:
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group