<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: When should I use a bloom filter index vs. a z-order index?  Any best practices around it? in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/when-should-i-use-a-bloom-filter-index-vs-a-z-order-index-any/m-p/15243#M9599</link>
    <description>&lt;P&gt;Just like B-tree indices in the traditional EDW world, Z-order indexing can be used on   high-cardinality columns like Primary Key columns and high-cardinality joins  like facts and  dimension tables joins. Z-order indexes can be created only on the first 32 columns of a table. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;&lt;A href="https://docs.databricks.com/spark/2.x/spark-sql/language-manual/create-bloomfilter-index.html" alt="https://docs.databricks.com/spark/2.x/spark-sql/language-manual/create-bloomfilter-index.html" target="_blank"&gt;Bloom filter index&lt;/A&gt; - can be used where z-order index cannot be used ( meaning columns on wide tables with more than 32 columns- for e.g. columns at the end of the tables like MM,YYYY, datetime, quarter etc. &lt;/LI&gt;&lt;LI&gt;Bloom filters are also be used  for less-used low or high cardinality fields that users typically use in WHERE conditions.  ( You don't want to use too many columns in one z-ordex index as its efficiency may reduce). &lt;/LI&gt;&lt;LI&gt;Bloom filter index can be used as a mechanism to avoid over-partitioning. In Databricks, we don’t recommend over-partition - each partition should have around 1 GB data (compressed) at least, In such cases, when the table is not too big and you don't want to partition,  putting a bloom filter index on date or yyyy-mm columns - can achieve the benefits of faster queries when a predicate is put on the date columns.  Similarly, if partitioning is done on the Year column then a bloom filter on Month-Year can help. &lt;/LI&gt;&lt;LI&gt;Even when a column has a z-order Index, we can put an additional&amp;nbsp; bloom filter index&amp;nbsp; on it for better performance. &lt;/LI&gt;&lt;/OL&gt;</description>
    <pubDate>Thu, 16 Sep 2021 00:31:16 GMT</pubDate>
    <dc:creator>User16826992724</dc:creator>
    <dc:date>2021-09-16T00:31:16Z</dc:date>
    <item>
      <title>When should I use a bloom filter index vs. a z-order index?  Any best practices around it?</title>
      <link>https://community.databricks.com/t5/data-engineering/when-should-i-use-a-bloom-filter-index-vs-a-z-order-index-any/m-p/15242#M9598</link>
      <description />
      <pubDate>Thu, 16 Sep 2021 00:18:50 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/when-should-i-use-a-bloom-filter-index-vs-a-z-order-index-any/m-p/15242#M9598</guid>
      <dc:creator>User16826992724</dc:creator>
      <dc:date>2021-09-16T00:18:50Z</dc:date>
    </item>
    <item>
      <title>Re: When should I use a bloom filter index vs. a z-order index?  Any best practices around it?</title>
      <link>https://community.databricks.com/t5/data-engineering/when-should-i-use-a-bloom-filter-index-vs-a-z-order-index-any/m-p/15243#M9599</link>
      <description>&lt;P&gt;Just like B-tree indices in the traditional EDW world, Z-order indexing can be used on   high-cardinality columns like Primary Key columns and high-cardinality joins  like facts and  dimension tables joins. Z-order indexes can be created only on the first 32 columns of a table. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;&lt;A href="https://docs.databricks.com/spark/2.x/spark-sql/language-manual/create-bloomfilter-index.html" alt="https://docs.databricks.com/spark/2.x/spark-sql/language-manual/create-bloomfilter-index.html" target="_blank"&gt;Bloom filter index&lt;/A&gt; - can be used where z-order index cannot be used ( meaning columns on wide tables with more than 32 columns- for e.g. columns at the end of the tables like MM,YYYY, datetime, quarter etc. &lt;/LI&gt;&lt;LI&gt;Bloom filters are also be used  for less-used low or high cardinality fields that users typically use in WHERE conditions.  ( You don't want to use too many columns in one z-ordex index as its efficiency may reduce). &lt;/LI&gt;&lt;LI&gt;Bloom filter index can be used as a mechanism to avoid over-partitioning. In Databricks, we don’t recommend over-partition - each partition should have around 1 GB data (compressed) at least, In such cases, when the table is not too big and you don't want to partition,  putting a bloom filter index on date or yyyy-mm columns - can achieve the benefits of faster queries when a predicate is put on the date columns.  Similarly, if partitioning is done on the Year column then a bloom filter on Month-Year can help. &lt;/LI&gt;&lt;LI&gt;Even when a column has a z-order Index, we can put an additional&amp;nbsp; bloom filter index&amp;nbsp; on it for better performance. &lt;/LI&gt;&lt;/OL&gt;</description>
      <pubDate>Thu, 16 Sep 2021 00:31:16 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/when-should-i-use-a-bloom-filter-index-vs-a-z-order-index-any/m-p/15243#M9599</guid>
      <dc:creator>User16826992724</dc:creator>
      <dc:date>2021-09-16T00:31:16Z</dc:date>
    </item>
  </channel>
</rss>

