<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic How are Struct type columns stored/accessed (interested in efficiency)? in Get Started Discussions</title>
    <link>https://community.databricks.com/t5/get-started-discussions/how-are-struct-type-columns-stored-accessed-interested-in/m-p/58349#M9078</link>
    <description>&lt;P&gt;Hello, I've searched around for awhile and didn't find a similar question here or elsewhere, so thought I'd ask...&lt;BR /&gt;&lt;BR /&gt;I'm assessing the storage/access efficiency of Struct type columns in delta tables.&amp;nbsp; I want to know more about how Databricks is storing Struct type field.&amp;nbsp; Can an SME add some details?&lt;/P&gt;&lt;P&gt;Example question I'm looking at:&amp;nbsp; Suppose I add an int field with low cardinality to a Struct column... in columnar database this would be stored/accessed efficiently, I believe... so would it also be stored/accessed efficiently as a field in a Struct column?&lt;/P&gt;&lt;P&gt;Note: I did find a Databricks page describing (maybe) how Apache Arrow is used in Databricks runtime 14+ (link below), but it referenced use in UDFs... I am using Structs in vanilla delta tables and figured that was significantly different.&lt;/P&gt;&lt;P&gt;-&amp;nbsp;&lt;A href="https://www.databricks.com/blog/arrow-optimized-python-udfs-apache-sparktm-35#:~:text=In%20Apache%20Spark%203.5%20and,columnar%20in%2Dmemory%20data%20representation." target="_self"&gt;https://www.databricks.com/blog/arrow-optimized-python-udfs-apache-sparktm-35#:~:text=In%20Apache%20Spark%203.5%20and,columnar%20in%2Dmemory%20data%20representation.&lt;/A&gt;&amp;nbsp; &amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Wed, 24 Jan 2024 17:59:12 GMT</pubDate>
    <dc:creator>crowley</dc:creator>
    <dc:date>2024-01-24T17:59:12Z</dc:date>
    <item>
      <title>How are Struct type columns stored/accessed (interested in efficiency)?</title>
      <link>https://community.databricks.com/t5/get-started-discussions/how-are-struct-type-columns-stored-accessed-interested-in/m-p/58349#M9078</link>
      <description>&lt;P&gt;Hello, I've searched around for awhile and didn't find a similar question here or elsewhere, so thought I'd ask...&lt;BR /&gt;&lt;BR /&gt;I'm assessing the storage/access efficiency of Struct type columns in delta tables.&amp;nbsp; I want to know more about how Databricks is storing Struct type field.&amp;nbsp; Can an SME add some details?&lt;/P&gt;&lt;P&gt;Example question I'm looking at:&amp;nbsp; Suppose I add an int field with low cardinality to a Struct column... in columnar database this would be stored/accessed efficiently, I believe... so would it also be stored/accessed efficiently as a field in a Struct column?&lt;/P&gt;&lt;P&gt;Note: I did find a Databricks page describing (maybe) how Apache Arrow is used in Databricks runtime 14+ (link below), but it referenced use in UDFs... I am using Structs in vanilla delta tables and figured that was significantly different.&lt;/P&gt;&lt;P&gt;-&amp;nbsp;&lt;A href="https://www.databricks.com/blog/arrow-optimized-python-udfs-apache-sparktm-35#:~:text=In%20Apache%20Spark%203.5%20and,columnar%20in%2Dmemory%20data%20representation." target="_self"&gt;https://www.databricks.com/blog/arrow-optimized-python-udfs-apache-sparktm-35#:~:text=In%20Apache%20Spark%203.5%20and,columnar%20in%2Dmemory%20data%20representation.&lt;/A&gt;&amp;nbsp; &amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 24 Jan 2024 17:59:12 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/how-are-struct-type-columns-stored-accessed-interested-in/m-p/58349#M9078</guid>
      <dc:creator>crowley</dc:creator>
      <dc:date>2024-01-24T17:59:12Z</dc:date>
    </item>
    <item>
      <title>Re: How are Struct type columns stored/accessed (interested in efficiency)?</title>
      <link>https://community.databricks.com/t5/get-started-discussions/how-are-struct-type-columns-stored-accessed-interested-in/m-p/100148#M9080</link>
      <description>&lt;P data-unlink="true"&gt;Delta Lake uses Apache Parquet as the underlying format for its data files.&lt;/P&gt;
&lt;P data-unlink="true"&gt;Spark structs are encoded as Parquet &lt;A href="https://github.com/apache/parquet-format/blob/apache-parquet-format-2.10.0/src/main/thrift/parquet.thrift#L414" target="_self"&gt;SchemaElements&lt;/A&gt;,&amp;nbsp;which are simply wrappers around standard types. What this means is that storage and access characteristics should be identical when interacting with, taking your example, an integer column at the top level of a schema versus an integer field inside of a struct - things like encoding and compression are identical with these two fields.&lt;/P&gt;
&lt;P data-unlink="true"&gt;You can use tools like PyArrow to do a deeper dive into how data is encoded in Parquet, here is some sample code that reads the Parquet footer and returns it in a human readable format:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;LI-CODE lang="markup"&gt;import pyarrow.parquet as pq

file_path = "/your/path/here/file.zstd.parquet"
parquet_file = pq.ParquetFile(file_path)
schema = parquet_file.schema
schema&lt;/LI-CODE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 26 Nov 2024 18:43:16 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/how-are-struct-type-columns-stored-accessed-interested-in/m-p/100148#M9080</guid>
      <dc:creator>cgrant</dc:creator>
      <dc:date>2024-11-26T18:43:16Z</dc:date>
    </item>
    <item>
      <title>Re: How are Struct type columns stored/accessed (interested in efficiency)?</title>
      <link>https://community.databricks.com/t5/get-started-discussions/how-are-struct-type-columns-stored-accessed-interested-in/m-p/111160#M9081</link>
      <description>&lt;P&gt;Thank you very much for the thoughful response.&amp;nbsp; Please excuse my belated feedback and thanks!&lt;/P&gt;</description>
      <pubDate>Tue, 25 Feb 2025 20:25:45 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/how-are-struct-type-columns-stored-accessed-interested-in/m-p/111160#M9081</guid>
      <dc:creator>crowley</dc:creator>
      <dc:date>2025-02-25T20:25:45Z</dc:date>
    </item>
  </channel>
</rss>

