Just like B-tree indices in the traditional EDW world, Z-order indexing can be used on high-cardinality columns like Primary Key columns and high-cardinality joins like facts and dimension tables joins. Z-order indexes can be created only on the first 32 columns of a table.
- Bloom filter index - can be used where z-order index cannot be used ( meaning columns on wide tables with more than 32 columns- for e.g. columns at the end of the tables like MM,YYYY, datetime, quarter etc.
- Bloom filters are also be used for less-used low or high cardinality fields that users typically use in WHERE conditions. ( You don't want to use too many columns in one z-ordex index as its efficiency may reduce).
- Bloom filter index can be used as a mechanism to avoid over-partitioning. In Databricks, we donโt recommend over-partition - each partition should have around 1 GB data (compressed) at least, In such cases, when the table is not too big and you don't want to partition, putting a bloom filter index on date or yyyy-mm columns - can achieve the benefits of faster queries when a predicate is put on the date columns. Similarly, if partitioning is done on the Year column then a bloom filter on Month-Year can help.
- Even when a column has a z-order Index, we can put an additional bloom filter index on it for better performance.