Hi @Fiona
To use Protobuf with a descriptor file, you can reference the file that is available to your compute cluster. Here are the steps to do so:
1. Import the necessary functions:
from pyspark.sql.protobuf.functions import to_protobuf, from_protobuf
2. Specify the path to the descriptor file:
descriptor_file = "/path/to/proto_descriptor.desc"
3. Use from_protobuf() to cast a binary column to a struct:
proto_events_df = input_df.select(from_protobuf(input_df.value, "BasicMessage", descFilePath=descriptor_file).alias("proto"))
4. Use to_protobuf() to cast a struct column to binary:
proto_binary_df = proto_events_df.select(to_protobuf(proto_events_df.proto, "BasicMessage", descriptor_file).alias("bytes"))
Sources:
- https://docs.databricks.com/structured-streaming/protocol-buffers.html