Databricks Community

hello_world · ‎12-28-2022

Have gone through the documentation, still cannot understand it.

How is bloom filter indexing a column different from z ordering a column?

Can somebody explain to me what exactly happens while these two techniques are applied?

Rishabh-Pandey · ‎12-29-2022

hey @Daniel Sahal

1-A Bloomfilter index is a space-efficient data structure that enables data skipping on chosen columns, particularly for fields containing arbitrary text

refer this code snipet to create bloom filter

CREATE BLOOMFILTER INDEX
ON [TABLE] table_name
[FOR COLUMNS( { columnName1 [ options ] } [, ...] ) ]
[ options ]
 
options
  OPTIONS ( { key1 [ = ] val1 } [, ...] )

2-Z-ordering is a technique to colocate related information in the same set of files. This co-locality is automatically used by Delta Lake on Databricks data-skipping algorithms. This behavior dramatically reduces the amount of data that Delta Lake on Databricks needs to read. To Z-order data, you specify the columns to order on in the

ZORDER BY clause:

OPTIMIZE events
WHERE date >= current_timestamp() - INTERVAL 1 day
ZORDER BY (eventType)

Rishabh Pandey

View solution in original post

daniel_sahal · ‎12-28-2022

Bloom filter is like a looking for a needle in the haystack (with FPP), so it's more useful for strings.

Z-Order is best with a couple of columns that are used for filters/joins.

They can run independently of each other or work together.

See example here:

https://www.mssqltips.com/sqlservertip/6968/bloom-filter-indexes-using-databricks-delta/

hello_world · ‎12-29-2022

In the example, bloom filter is also used for filters.

How do we decide the columns to be indexed and z ordered? Based on data type String or Non-String?

Rishabh-Pandey · ‎12-29-2022

hey @Daniel Sahal

1-A Bloomfilter index is a space-efficient data structure that enables data skipping on chosen columns, particularly for fields containing arbitrary text

refer this code snipet to create bloom filter

CREATE BLOOMFILTER INDEX
ON [TABLE] table_name
[FOR COLUMNS( { columnName1 [ options ] } [, ...] ) ]
[ options ]
 
options
  OPTIONS ( { key1 [ = ] val1 } [, ...] )

2-Z-ordering is a technique to colocate related information in the same set of files. This co-locality is automatically used by Delta Lake on Databricks data-skipping algorithms. This behavior dramatically reduces the amount of data that Delta Lake on Databricks needs to read. To Z-order data, you specify the columns to order on in the

ZORDER BY clause:

OPTIMIZE events
WHERE date >= current_timestamp() - INTERVAL 1 day
ZORDER BY (eventType)

Rishabh Pandey

Databricks Community

What exactly is Z Ordering and Bloom Filter?

Photos

Join Us as a Local Community Builder!

Exciting Opportunity to Collaborate with Us!

Intelligent Data Warehousing: AI/BI for Self-service Analytics

Share Your Thoughts on Databricks & Get Rewarded!

Get Started With Lakehouse Architecture | Pass a quiz to earn your certificate completion.

Virtual Learning Festival: 9 April - 30 April