What exactly is Z Ordering and Bloom Filter?

hello_world
Databricks Partner

Have gone through the documentation, still cannot understand it.

How is bloom filter indexing a column different from z ordering a column?

Can somebody explain to me what exactly happens while these two techniques are applied?

daniel_sahal
Databricks MVP

Bloom filter is like a looking for a needle in the haystack (with FPP), so it's more useful for strings.

Z-Order is best with a couple of columns that are used for filters/joins.

They can run independently of each other or work together.

See example here:

https://www.mssqltips.com/sqlservertip/6968/bloom-filter-indexes-using-databricks-delta/

In the example, bloom filter is also used for filters.

How do we decide the columns to be indexed and z ordered? Based on data type String or Non-String?

Rishabh-Pandey
Databricks MVP

hey @Daniel Sahal​ 

1-A Bloomfilter index is a space-efficient data structure that enables data skipping on chosen columns, particularly for fields containing arbitrary text

refer this code snipet to create bloom filter

CREATE BLOOMFILTER INDEX
ON [TABLE] table_name
[FOR COLUMNS( { columnName1 [ options ] } [, ...] ) ]
[ options ]
 
options
  OPTIONS ( { key1 [ = ] val1 } [, ...] )

2-Z-ordering is a technique to colocate related information in the same set of files. This co-locality is automatically used by Delta Lake on Databricks data-skipping algorithms. This behavior dramatically reduces the amount of data that Delta Lake on Databricks needs to read. To Z-order data, you specify the columns to order on in the 

ZORDER BY clause:

OPTIMIZE events
WHERE date >= current_timestamp() - INTERVAL 1 day
ZORDER BY (eventType)

Rishabh Pandey

View solution in original post