โ07-24-2023 09:01 AM
I have proto files (offline data storage) that I'd like to read from a Databricks notebook. I found this documentation (https://docs.databricks.com/structured-streaming/protocol-buffers.html), but it only covers how to read the protobuf data once the binary is already in a DataFrame. How do I read the binary data in in the first place?
โ07-26-2023 03:26 PM - edited โ07-26-2023 03:27 PM
Hi @Fiona
To use Protobuf with a descriptor file, you can reference the file that is available to your compute cluster. Here are the steps to do so:
1. Import the necessary functions:
from pyspark.sql.protobuf.functions import to_protobuf, from_protobuf
2. Specify the path to the descriptor file:
descriptor_file = "/path/to/proto_descriptor.desc"
3. Use from_protobuf() to cast a binary column to a struct:
proto_events_df = input_df.select(from_protobuf(input_df.value, "BasicMessage", descFilePath=descriptor_file).alias("proto"))
4. Use to_protobuf() to cast a struct column to binary:
proto_binary_df = proto_events_df.select(to_protobuf(proto_events_df.proto, "BasicMessage", descriptor_file).alias("bytes"))
Sources:
- https://docs.databricks.com/structured-streaming/protocol-buffers.html
โ07-26-2023 03:26 PM - edited โ07-26-2023 03:27 PM
Hi @Fiona
To use Protobuf with a descriptor file, you can reference the file that is available to your compute cluster. Here are the steps to do so:
1. Import the necessary functions:
from pyspark.sql.protobuf.functions import to_protobuf, from_protobuf
2. Specify the path to the descriptor file:
descriptor_file = "/path/to/proto_descriptor.desc"
3. Use from_protobuf() to cast a binary column to a struct:
proto_events_df = input_df.select(from_protobuf(input_df.value, "BasicMessage", descFilePath=descriptor_file).alias("proto"))
4. Use to_protobuf() to cast a struct column to binary:
proto_binary_df = proto_events_df.select(to_protobuf(proto_events_df.proto, "BasicMessage", descriptor_file).alias("bytes"))
Sources:
- https://docs.databricks.com/structured-streaming/protocol-buffers.html
โ07-28-2023 06:36 AM
Hi! Yeah, I think I understand everything about that, but I don't know how to create "input_df" given a file of multiple protobuf records, if that makes sense
โ09-11-2023 01:00 PM
If you have proto files in offline data storage, you should be able to read them with:
Excited to expand your horizons with us? Click here to Register and begin your journey to success!
Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!