cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

What exactly is Z Ordering and Bloom Filter?

hello_world
New Contributor III

Have gone through the documentation, still cannot understand it.

How is bloom filter indexing a column different from z ordering a column?

Can somebody explain to me what exactly happens while these two techniques are applied?

1 ACCEPTED SOLUTION

Accepted Solutions

Rishabh264
Honored Contributor II

hey @Daniel Sahal​ 

1-A Bloomfilter index is a space-efficient data structure that enables data skipping on chosen columns, particularly for fields containing arbitrary text

refer this code snipet to create bloom filter

CREATE BLOOMFILTER INDEX
ON [TABLE] table_name
[FOR COLUMNS( { columnName1 [ options ] } [, ...] ) ]
[ options ]
 
options
  OPTIONS ( { key1 [ = ] val1 } [, ...] )

2-Z-ordering is a technique to colocate related information in the same set of files. This co-locality is automatically used by Delta Lake on Databricks data-skipping algorithms. This behavior dramatically reduces the amount of data that Delta Lake on Databricks needs to read. To Z-order data, you specify the columns to order on in the 

ZORDER BY clause:

OPTIMIZE events
WHERE date >= current_timestamp() - INTERVAL 1 day
ZORDER BY (eventType)

View solution in original post

3 REPLIES 3

daniel_sahal
Honored Contributor III

Bloom filter is like a looking for a needle in the haystack (with FPP), so it's more useful for strings.

Z-Order is best with a couple of columns that are used for filters/joins.

They can run independently of each other or work together.

See example here:

https://www.mssqltips.com/sqlservertip/6968/bloom-filter-indexes-using-databricks-delta/

In the example, bloom filter is also used for filters.

How do we decide the columns to be indexed and z ordered? Based on data type String or Non-String?

Rishabh264
Honored Contributor II

hey @Daniel Sahal​ 

1-A Bloomfilter index is a space-efficient data structure that enables data skipping on chosen columns, particularly for fields containing arbitrary text

refer this code snipet to create bloom filter

CREATE BLOOMFILTER INDEX
ON [TABLE] table_name
[FOR COLUMNS( { columnName1 [ options ] } [, ...] ) ]
[ options ]
 
options
  OPTIONS ( { key1 [ = ] val1 } [, ...] )

2-Z-ordering is a technique to colocate related information in the same set of files. This co-locality is automatically used by Delta Lake on Databricks data-skipping algorithms. This behavior dramatically reduces the amount of data that Delta Lake on Databricks needs to read. To Z-order data, you specify the columns to order on in the 

ZORDER BY clause:

OPTIMIZE events
WHERE date >= current_timestamp() - INTERVAL 1 day
ZORDER BY (eventType)

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.