cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

What exactly is Z Ordering and Bloom Filter?

hello_world
New Contributor III

Have gone through the documentation, still cannot understand it.

How is bloom filter indexing a column different from z ordering a column?

Can somebody explain to me what exactly happens while these two techniques are applied?

1 ACCEPTED SOLUTION

Accepted Solutions

Rishabh-Pandey
Esteemed Contributor

hey @Daniel Sahal​ 

1-A Bloomfilter index is a space-efficient data structure that enables data skipping on chosen columns, particularly for fields containing arbitrary text

refer this code snipet to create bloom filter

CREATE BLOOMFILTER INDEX
ON [TABLE] table_name
[FOR COLUMNS( { columnName1 [ options ] } [, ...] ) ]
[ options ]
 
options
  OPTIONS ( { key1 [ = ] val1 } [, ...] )

2-Z-ordering is a technique to colocate related information in the same set of files. This co-locality is automatically used by Delta Lake on Databricks data-skipping algorithms. This behavior dramatically reduces the amount of data that Delta Lake on Databricks needs to read. To Z-order data, you specify the columns to order on in the 

ZORDER BY clause:

OPTIMIZE events
WHERE date >= current_timestamp() - INTERVAL 1 day
ZORDER BY (eventType)

Rishabh Pandey

View solution in original post

3 REPLIES 3

daniel_sahal
Esteemed Contributor

Bloom filter is like a looking for a needle in the haystack (with FPP), so it's more useful for strings.

Z-Order is best with a couple of columns that are used for filters/joins.

They can run independently of each other or work together.

See example here:

https://www.mssqltips.com/sqlservertip/6968/bloom-filter-indexes-using-databricks-delta/

In the example, bloom filter is also used for filters.

How do we decide the columns to be indexed and z ordered? Based on data type String or Non-String?

Rishabh-Pandey
Esteemed Contributor

hey @Daniel Sahal​ 

1-A Bloomfilter index is a space-efficient data structure that enables data skipping on chosen columns, particularly for fields containing arbitrary text

refer this code snipet to create bloom filter

CREATE BLOOMFILTER INDEX
ON [TABLE] table_name
[FOR COLUMNS( { columnName1 [ options ] } [, ...] ) ]
[ options ]
 
options
  OPTIONS ( { key1 [ = ] val1 } [, ...] )

2-Z-ordering is a technique to colocate related information in the same set of files. This co-locality is automatically used by Delta Lake on Databricks data-skipping algorithms. This behavior dramatically reduces the amount of data that Delta Lake on Databricks needs to read. To Z-order data, you specify the columns to order on in the 

ZORDER BY clause:

OPTIMIZE events
WHERE date >= current_timestamp() - INTERVAL 1 day
ZORDER BY (eventType)

Rishabh Pandey

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group