<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Protobuf deserialization in Databricks in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/protobuf-deserialization-in-databricks/m-p/15347#M9679</link>
    <description>&lt;P&gt;@Jani Sourander​&amp;nbsp; would you mind posting the logic that you used in your ProtoFetched class, I am running into the same cannot pickle issue for a protobuf pb2.py file and attempted to re-create the ProtoFetched class and the my_test_function but am still receiving the error.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thank you&lt;/P&gt;</description>
    <pubDate>Fri, 01 Apr 2022 15:51:17 GMT</pubDate>
    <dc:creator>Anonymous</dc:creator>
    <dc:date>2022-04-01T15:51:17Z</dc:date>
    <item>
      <title>Protobuf deserialization in Databricks</title>
      <link>https://community.databricks.com/t5/data-engineering/protobuf-deserialization-in-databricks/m-p/15335#M9667</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;​&lt;/P&gt;&lt;P&gt;Let's assume I have these things:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Binary column containing protobuf-serialized data&lt;/LI&gt;&lt;LI&gt;The .proto file including message definition&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;​&lt;/P&gt;&lt;P&gt;What different approaches have Databricks users chosen to deserialize the data? Python is the programming language that I am mostly familiar with, so anything that can be achieved using pyspark would be great. &lt;/P&gt;&lt;P&gt;​&lt;/P&gt;&lt;P&gt;Locally, my approach would be to use the grpc_tool.protoc to generate the pb2 files (message.proto -&amp;gt; message_pb2.python). I would later on import these classes and use the approariate message to deserialize the binary data. Example code below:&lt;/P&gt;&lt;P&gt;​&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;import os
import pkg_resources
from grpc_tools import protoc
&amp;nbsp;
# Verbose
print(f"[Building] {protopath}")
    
# For some reason, grpc_tools.protoc doesn't include _proto module which causes errors:
# "google.protobuf.Timestamp" is not defined.
path_to_module = pkg_resources.resource_filename('grpc_tools', '_proto')
&amp;nbsp;
# Flags    
args = (
    'grpc_tools.protoc',  # args 0 is not needed
    f"--proto_path={PROTO_PATH}",
    f"--python_out={OUTPUT_PATH}",
    f"-I{path_to_module}",
    # f"--grpc_python_out={OUTPUT_PATH}", 
    protopath.split("/")[-1]
)
    
protoc.main(args)&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;My current ideas are:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Perform the code above using an external machine. Create a package "my_message_derializer.wheel" and use this as a dependent library on the Job/Task/Cluster. This would need to be updated each time the proto file changes using e.g. git webhooks.&lt;/LI&gt;&lt;LI&gt;Or, in the Databricks, install grpcio and grpcio-tools, and run similar code as above on the driver. Then import the created pb2 class and use the message as usual.&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;​&lt;/P&gt;&lt;P&gt;Is there any other way of using the deserializer with Spark? Something a bit less manual?&lt;/P&gt;&lt;P&gt;​&lt;/P&gt;</description>
      <pubDate>Wed, 15 Sep 2021 10:36:45 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/protobuf-deserialization-in-databricks/m-p/15335#M9667</guid>
      <dc:creator>sourander</dc:creator>
      <dc:date>2021-09-15T10:36:45Z</dc:date>
    </item>
    <item>
      <title>Re: Protobuf deserialization in Databricks</title>
      <link>https://community.databricks.com/t5/data-engineering/protobuf-deserialization-in-databricks/m-p/15337#M9669</link>
      <description>&lt;P&gt;@Kaniz Fatma​, the community hasn't provided answers yet.  Similar question has appeared in the Databricks Forums earlier and there are no ideas either. Do you have any ideas how to perform this deserialization efficiently in Databricks? Can we somehow give Spark a protobuf file/message that it would use as a serde for a given column?&lt;/P&gt;</description>
      <pubDate>Thu, 30 Sep 2021 06:49:37 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/protobuf-deserialization-in-databricks/m-p/15337#M9669</guid>
      <dc:creator>sourander</dc:creator>
      <dc:date>2021-09-30T06:49:37Z</dc:date>
    </item>
    <item>
      <title>Re: Protobuf deserialization in Databricks</title>
      <link>https://community.databricks.com/t5/data-engineering/protobuf-deserialization-in-databricks/m-p/15339#M9671</link>
      <description>&lt;P&gt;@Jani Sourander​&amp;nbsp;- Following up on Kaniz's answer, we have escalated the issue to the proper team. As she said, they'll get back to you as soon as they can. &lt;/P&gt;</description>
      <pubDate>Thu, 30 Sep 2021 17:29:10 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/protobuf-deserialization-in-databricks/m-p/15339#M9671</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2021-09-30T17:29:10Z</dc:date>
    </item>
    <item>
      <title>Re: Protobuf deserialization in Databricks</title>
      <link>https://community.databricks.com/t5/data-engineering/protobuf-deserialization-in-databricks/m-p/15340#M9672</link>
      <description>&lt;P&gt;Hi @Jani Sourander​&amp;nbsp;,&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I found this library &lt;A href="https://github.com/saurfang/sparksql-protobuf" alt="https://github.com/saurfang/sparksql-protobuf" target="_blank"&gt;sparksql-protobuf&lt;/A&gt; it might work. it has not been updated in a while.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thank you&lt;/P&gt;</description>
      <pubDate>Mon, 11 Oct 2021 23:41:22 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/protobuf-deserialization-in-databricks/m-p/15340#M9672</guid>
      <dc:creator>jose_gonzalez</dc:creator>
      <dc:date>2021-10-11T23:41:22Z</dc:date>
    </item>
    <item>
      <title>Re: Protobuf deserialization in Databricks</title>
      <link>https://community.databricks.com/t5/data-engineering/protobuf-deserialization-in-databricks/m-p/15341#M9673</link>
      <description>&lt;P&gt;Hi @Jose Gonzalez​&amp;nbsp;and thank you for the reply! That library seems out of date and lacks documentation. I wonder if ScalaPB would be a  better option. I don't have any scala experience, but since scala UDF's perform well with Spark. I suppose learning Scala and taking this approach would be an option. There would be some other benefits too (such as writing other Databricks UDF's in Scala for performance increasement)&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;It is starting to seem that the best option for MVP is actually the Python-based option that is used as a wheel library in Databricks.&lt;/P&gt;</description>
      <pubDate>Tue, 12 Oct 2021 06:11:50 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/protobuf-deserialization-in-databricks/m-p/15341#M9673</guid>
      <dc:creator>sourander</dc:creator>
      <dc:date>2021-10-12T06:11:50Z</dc:date>
    </item>
    <item>
      <title>Re: Protobuf deserialization in Databricks</title>
      <link>https://community.databricks.com/t5/data-engineering/protobuf-deserialization-in-databricks/m-p/15342#M9674</link>
      <description>&lt;P&gt;I would definitely go with a Python option here, that way you can use PandasUDFs which are faster than scala UDFs. Happy coding!&lt;/P&gt;</description>
      <pubDate>Wed, 20 Oct 2021 21:53:31 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/protobuf-deserialization-in-databricks/m-p/15342#M9674</guid>
      <dc:creator>Dan_Z</dc:creator>
      <dc:date>2021-10-20T21:53:31Z</dc:date>
    </item>
    <item>
      <title>Re: Protobuf deserialization in Databricks</title>
      <link>https://community.databricks.com/t5/data-engineering/protobuf-deserialization-in-databricks/m-p/15343#M9675</link>
      <description>&lt;P&gt;@Dan Zafar​&amp;nbsp;do you have any ideas on how would I optimize the query that requires an access to a class method? Details below.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I created a class ProtoFetcher that hosts various proto_pb2.py files create using protoc.tools (similarly as in the original post). It can be instantiated by giving a name of the data; internally it imports the correct some_proto_pb2 class, assigns it to class variables and uses the get_attr("some_proto_pb2", "name_of_this_message") to fetch the correct GeneratedProtocolMessageType (which is defined in google.protobuf.pyext.cpp_message).&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Sadly, I can't seem to find a way to use the UDF so that the class initialisation wouldn't be &lt;B&gt;inside&lt;/B&gt; the function. If I'm not completely wrong, this means that the class will get initialised for each row as a loop. Trying to access the method from outside the udf definition will raise a "PicklingError: Could not serialize object: TypeError: cannot pickle 'google.protobuf.pyext._message.MessageDescriptor' object"&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;@udf("string")
def my_test_func(blob):
    d = ProtoFetcher("name_of_this_data")
    return d.blob_to_json(blob)
&amp;nbsp;
new_df = df.withColumn("blob_as_json", my_test_func("original_blob_col"))&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;Running a display() on that new_df takes about 40 seconds, and this test file has only 600 rows. I also tried the pandas_udf approch and got the similar results. I defined the pandas_udf as:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;@pandas_udf("string")
def my_test_function(s: pd.Series) -&amp;gt; pd.Series:
    d = ProtoFetcher("name_of_this_data")
    s_json = s.apply(d.blob_to_json)
    return s_json&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&amp;nbsp;NOTE: The blob column has a fair amount of data, though. Running the same code using local Pandas installation will take 32 seconds on a fairly powerful laptop. This was being run with a code:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;df['new_col'] = df['original_col'].apply(d.blob_to_json)&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 28 Oct 2021 13:18:46 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/protobuf-deserialization-in-databricks/m-p/15343#M9675</guid>
      <dc:creator>sourander</dc:creator>
      <dc:date>2021-10-28T13:18:46Z</dc:date>
    </item>
    <item>
      <title>Re: Protobuf deserialization in Databricks</title>
      <link>https://community.databricks.com/t5/data-engineering/protobuf-deserialization-in-databricks/m-p/15344#M9676</link>
      <description>&lt;P&gt;Just use an iterator of Series UDF. See here: &lt;A href="https://docs.databricks.com/spark/latest/spark-sql/udf-python-pandas.html#iterator-of-series-to-iterator-of-series-udf" target="test_blank"&gt;https://docs.databricks.com/spark/latest/spark-sql/udf-python-pandas.html#iterator-of-series-to-iterator-of-series-udf&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;This allows you to set up some pre-defined state (like loading a file) before doing the computation.&lt;/P&gt;</description>
      <pubDate>Thu, 28 Oct 2021 15:32:45 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/protobuf-deserialization-in-databricks/m-p/15344#M9676</guid>
      <dc:creator>Dan_Z</dc:creator>
      <dc:date>2021-10-28T15:32:45Z</dc:date>
    </item>
    <item>
      <title>Re: Protobuf deserialization in Databricks</title>
      <link>https://community.databricks.com/t5/data-engineering/protobuf-deserialization-in-databricks/m-p/15345#M9677</link>
      <description>&lt;P&gt;In case this helps someone else, I will leave the code here for reference that I use for running the method in Databricks inside a UDF.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;from typing import Iterator
&amp;nbsp;
@pandas_udf("string")
def my_test_function(blobs: Iterator[pd.Series]) -&amp;gt; Iterator[pd.Series]:
    d = ProtoFetcher("my_message_name")
    for blob in blobs:
        yield blob.apply(d.blob_to_json)
        
new_df = df.withColumn("col_as_json", my_test_function("original_col"))&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;For a small dataset, this performed just as the non-Pandas UDF, but I suppose this will change when I scale the dataset up.&lt;/P&gt;</description>
      <pubDate>Fri, 29 Oct 2021 05:56:37 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/protobuf-deserialization-in-databricks/m-p/15345#M9677</guid>
      <dc:creator>sourander</dc:creator>
      <dc:date>2021-10-29T05:56:37Z</dc:date>
    </item>
    <item>
      <title>Re: Protobuf deserialization in Databricks</title>
      <link>https://community.databricks.com/t5/data-engineering/protobuf-deserialization-in-databricks/m-p/15346#M9678</link>
      <description>&lt;P&gt;Thanks @Jani Sourander​&amp;nbsp;, you can mark this as 'Best Answer' so that the question is resolved and answer is easy to find for future users.&lt;/P&gt;</description>
      <pubDate>Fri, 29 Oct 2021 21:41:07 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/protobuf-deserialization-in-databricks/m-p/15346#M9678</guid>
      <dc:creator>Dan_Z</dc:creator>
      <dc:date>2021-10-29T21:41:07Z</dc:date>
    </item>
    <item>
      <title>Re: Protobuf deserialization in Databricks</title>
      <link>https://community.databricks.com/t5/data-engineering/protobuf-deserialization-in-databricks/m-p/15347#M9679</link>
      <description>&lt;P&gt;@Jani Sourander​&amp;nbsp; would you mind posting the logic that you used in your ProtoFetched class, I am running into the same cannot pickle issue for a protobuf pb2.py file and attempted to re-create the ProtoFetched class and the my_test_function but am still receiving the error.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thank you&lt;/P&gt;</description>
      <pubDate>Fri, 01 Apr 2022 15:51:17 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/protobuf-deserialization-in-databricks/m-p/15347#M9679</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2022-04-01T15:51:17Z</dc:date>
    </item>
    <item>
      <title>Re: Protobuf deserialization in Databricks</title>
      <link>https://community.databricks.com/t5/data-engineering/protobuf-deserialization-in-databricks/m-p/15348#M9680</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;The ProtoFetcher is just a wrapper for the pb2 file(s) created using the protoc. As of now, I am using Scala to do the same trick (it was about 5x faster). &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;To avoid the pickling problems, you need to build the ProtoFetcher library into a wheel file on a non-Databricks cluster (e.g. your laptop.) I personally prefer Python Poetry for managing libraries and the build process. Then upload this wheel to S3 and use is as a cluster library in Databricks.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;The class itself is simply a wrapper for pb2 files:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;from google.protobuf.pyext.cpp_message import GeneratedProtocolMessageType
from google.protobuf.json_format import MessageToJson
from ProtoFetcher.PROTO_OUT import example_message_pb2
&amp;nbsp;
&amp;nbsp;
# Hard-coded list of implemented messages. A single proto can contain many useful messages.
known_messages = {
    "example_message": {
        "proto_source": "example_message.proto",
        "class": example_message_pb2,
        "proto_messagename": "ExampleMessageList",
        "proto_sub_messagename": "ExampleMessageValues",
    },
}
&amp;nbsp;
&amp;nbsp;
class ProtoFetcher:
    """A class to represent a proto message. A wrapper for google.protobuf.
&amp;nbsp;
    Parameters
    ----------
    messagename : str
        the common name of the service.
&amp;nbsp;
    Examples
    --------
    &amp;gt;&amp;gt;&amp;gt; service = ProtoFetcher('example_message')
    &amp;gt;&amp;gt;&amp;gt; serde = service.get_deserializer()
    &amp;gt;&amp;gt;&amp;gt; 'some_field_that_is_part_of_the_proto_message' in serde.DESCRIPTOR.fields_by_name.keys()
    True
    """
&amp;nbsp;
    def __init__(self, messagename: str):
        """Constructs the proto message wrapper."""
&amp;nbsp;
        # Input check
        if messagename not in known_messages.keys():
            raise NameError(f"The service must be one of: {self.list_message_names()}")
&amp;nbsp;
        # Settings/Info for this messagename from the config
        self.msg_info = known_messages[messagename]
&amp;nbsp;
        # Initialize the message
        self.deserializer = self._init_deserializer_class()
&amp;nbsp;
    def _init_deserializer_class(self) -&amp;gt; GeneratedProtocolMessageType:
        """Gets the protocol message type from the Protobuf descriptors (_pb2.py file).
        It can be used for serializing and deserializing the messages.
        """
        # Get
        message = getattr(self.msg_info["class"], self.msg_info["proto_messagename"])
&amp;nbsp;
        # Some messages include a submessage. This happens at least when
        # a blob field description is a list of lists.
        if self.msg_info["proto_sub_messagename"]:
            message = getattr(message, self.msg_info["proto_sub_messagename"])
&amp;nbsp;
        return message
&amp;nbsp;
    def get_deserializer(self) -&amp;gt; GeneratedProtocolMessageType:
        """Getter for the message that can be used for serializing and deserializing the messages."""
        return self.deserializer
&amp;nbsp;
    def blob_to_json(self, blob) -&amp;gt; str:
        """Converts the bytes-like BLOB object to a the human-readable form.
&amp;nbsp;
        Parameters
        ----------
        blob : str/bytes
            bytes-like object or ASCII string including the contents of the blob field.
&amp;nbsp;
        Returns
        -------
        j : str
            Stringified JSON containing the deserialized message data.
        """
&amp;nbsp;
        # Deserialize the proto
        deserializer = self.get_deserializer()
        message = deserializer.FromString(blob)
&amp;nbsp;
        # Convert to JSON and clean it
        my_message_as_json = MessageToJson(message)
        j = json.dumps(json.loads(my_message_as_json))
        return j&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 04 Apr 2022 05:41:53 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/protobuf-deserialization-in-databricks/m-p/15348#M9680</guid>
      <dc:creator>sourander</dc:creator>
      <dc:date>2022-04-04T05:41:53Z</dc:date>
    </item>
    <item>
      <title>Re: Protobuf deserialization in Databricks</title>
      <link>https://community.databricks.com/t5/data-engineering/protobuf-deserialization-in-databricks/m-p/15349#M9681</link>
      <description>&lt;P&gt;&amp;nbsp;Your article content is extremely fascinating. I'm extremely intrigued with your post. I desire to get more incredible posts.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;A href="https://www.direct2hr.us/" alt="https://www.direct2hr.us/" target="_blank"&gt;Direct 2 HR&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 05 Apr 2022 10:12:22 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/protobuf-deserialization-in-databricks/m-p/15349#M9681</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2022-04-05T10:12:22Z</dc:date>
    </item>
    <item>
      <title>Re: Protobuf deserialization in Databricks</title>
      <link>https://community.databricks.com/t5/data-engineering/protobuf-deserialization-in-databricks/m-p/43026#M27471</link>
      <description>&lt;P&gt;We've now added a native connector with parsing directly with Spark Dataframes.&amp;nbsp;&lt;BR /&gt;&lt;A href="https://docs.databricks.com/en/structured-streaming/protocol-buffers.html" target="_blank" rel="noopener"&gt;https://docs.databricks.com/en/structured-streaming/protocol-buffers.html&lt;/A&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;&lt;PRE&gt;&lt;SPAN class=""&gt;from&lt;/SPAN&gt; &lt;SPAN class=""&gt;pyspark.sql.protobuf.functions&lt;/SPAN&gt; &lt;SPAN class=""&gt;import&lt;/SPAN&gt; &lt;SPAN class=""&gt;to_protobuf&lt;/SPAN&gt;&lt;SPAN class=""&gt;,&lt;/SPAN&gt; &lt;SPAN class=""&gt;from_protobuf&lt;/SPAN&gt;

&lt;SPAN class=""&gt;schema_registry_options&lt;/SPAN&gt; &lt;SPAN class=""&gt;=&lt;/SPAN&gt; &lt;SPAN class=""&gt;{&lt;/SPAN&gt;
  &lt;SPAN class=""&gt;"schema.registry.subject"&lt;/SPAN&gt; &lt;SPAN class=""&gt;:&lt;/SPAN&gt; &lt;SPAN class=""&gt;"app-events-value"&lt;/SPAN&gt;&lt;SPAN class=""&gt;,&lt;/SPAN&gt;
  &lt;SPAN class=""&gt;"schema.registry.address"&lt;/SPAN&gt; &lt;SPAN class=""&gt;:&lt;/SPAN&gt; &lt;SPAN class=""&gt;"https://schema-registry:8081/"&lt;/SPAN&gt;
&lt;SPAN class=""&gt;}&lt;/SPAN&gt;

&lt;SPAN class=""&gt;# Convert binary Protobuf to SQL struct with from_protobuf():&lt;/SPAN&gt;
&lt;SPAN class=""&gt;proto_events_df&lt;/SPAN&gt; &lt;SPAN class=""&gt;=&lt;/SPAN&gt; &lt;SPAN class=""&gt;(&lt;/SPAN&gt;
  &lt;SPAN class=""&gt;input_df&lt;/SPAN&gt;
    &lt;SPAN class=""&gt;.&lt;/SPAN&gt;&lt;SPAN class=""&gt;select&lt;/SPAN&gt;&lt;SPAN class=""&gt;(&lt;/SPAN&gt;
      &lt;SPAN class=""&gt;from_protobuf&lt;/SPAN&gt;&lt;SPAN class=""&gt;(&lt;/SPAN&gt;&lt;SPAN class=""&gt;"proto_bytes"&lt;/SPAN&gt;&lt;SPAN class=""&gt;,&lt;/SPAN&gt; &lt;SPAN class=""&gt;options&lt;/SPAN&gt; &lt;SPAN class=""&gt;=&lt;/SPAN&gt; &lt;SPAN class=""&gt;schema_registry_options&lt;/SPAN&gt;&lt;SPAN class=""&gt;)&lt;/SPAN&gt;
        &lt;SPAN class=""&gt;.&lt;/SPAN&gt;&lt;SPAN class=""&gt;alias&lt;/SPAN&gt;&lt;SPAN class=""&gt;(&lt;/SPAN&gt;&lt;SPAN class=""&gt;"proto_event"&lt;/SPAN&gt;&lt;SPAN class=""&gt;)&lt;/SPAN&gt;
    &lt;SPAN class=""&gt;)&lt;/SPAN&gt;
&lt;SPAN class=""&gt;)&lt;/SPAN&gt;

&lt;SPAN class=""&gt;# Convert SQL struct to binary Protobuf with to_protobuf():&lt;/SPAN&gt;
&lt;SPAN class=""&gt;protobuf_binary_df&lt;/SPAN&gt; &lt;SPAN class=""&gt;=&lt;/SPAN&gt; &lt;SPAN class=""&gt;(&lt;/SPAN&gt;
  &lt;SPAN class=""&gt;proto_events_df&lt;/SPAN&gt;
    &lt;SPAN class=""&gt;.&lt;/SPAN&gt;&lt;SPAN class=""&gt;selectExpr&lt;/SPAN&gt;&lt;SPAN class=""&gt;(&lt;/SPAN&gt;&lt;SPAN class=""&gt;"struct(name, id, context) as event"&lt;/SPAN&gt;&lt;SPAN class=""&gt;)&lt;/SPAN&gt;
    &lt;SPAN class=""&gt;.&lt;/SPAN&gt;&lt;SPAN class=""&gt;select&lt;/SPAN&gt;&lt;SPAN class=""&gt;(&lt;/SPAN&gt;
      &lt;SPAN class=""&gt;to_protobuf&lt;/SPAN&gt;&lt;SPAN class=""&gt;(&lt;/SPAN&gt;&lt;SPAN class=""&gt;"event"&lt;/SPAN&gt;&lt;SPAN class=""&gt;,&lt;/SPAN&gt; &lt;SPAN class=""&gt;options&lt;/SPAN&gt; &lt;SPAN class=""&gt;=&lt;/SPAN&gt; &lt;SPAN class=""&gt;schema_registry_options&lt;/SPAN&gt;&lt;SPAN class=""&gt;)&lt;/SPAN&gt;
        &lt;SPAN class=""&gt;.&lt;/SPAN&gt;&lt;SPAN class=""&gt;alias&lt;/SPAN&gt;&lt;SPAN class=""&gt;(&lt;/SPAN&gt;&lt;SPAN class=""&gt;"proto_bytes"&lt;/SPAN&gt;&lt;SPAN class=""&gt;)&lt;/SPAN&gt;
    &lt;SPAN class=""&gt;)&lt;/SPAN&gt;
&lt;SPAN class=""&gt;)&lt;/SPAN&gt;&lt;/PRE&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 31 Aug 2023 23:12:58 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/protobuf-deserialization-in-databricks/m-p/43026#M27471</guid>
      <dc:creator>Amou</dc:creator>
      <dc:date>2023-08-31T23:12:58Z</dc:date>
    </item>
  </channel>
</rss>

