12-28-2022 04:05 PM
Have gone through the documentation, still cannot understand it.
How is bloom filter indexing a column different from z ordering a column?
Can somebody explain to me what exactly happens while these two techniques are applied?
12-29-2022 12:28 AM
hey @Daniel Sahal
1-A Bloomfilter index is a space-efficient data structure that enables data skipping on chosen columns, particularly for fields containing arbitrary text
refer this code snipet to create bloom filter
CREATE BLOOMFILTER INDEX
ON [TABLE] table_name
[FOR COLUMNS( { columnName1 [ options ] } [, ...] ) ]
[ options ]
options
OPTIONS ( { key1 [ = ] val1 } [, ...] )
2-Z-ordering is a technique to colocate related information in the same set of files. This co-locality is automatically used by Delta Lake on Databricks data-skipping algorithms. This behavior dramatically reduces the amount of data that Delta Lake on Databricks needs to read. To Z-order data, you specify the columns to order on in the
ZORDER BY clause:
OPTIMIZE events
WHERE date >= current_timestamp() - INTERVAL 1 day
ZORDER BY (eventType)
12-28-2022 10:28 PM
Bloom filter is like a looking for a needle in the haystack (with FPP), so it's more useful for strings.
Z-Order is best with a couple of columns that are used for filters/joins.
They can run independently of each other or work together.
See example here:
https://www.mssqltips.com/sqlservertip/6968/bloom-filter-indexes-using-databricks-delta/
12-29-2022 02:12 AM
In the example, bloom filter is also used for filters.
How do we decide the columns to be indexed and z ordered? Based on data type String or Non-String?
12-29-2022 12:28 AM
hey @Daniel Sahal
1-A Bloomfilter index is a space-efficient data structure that enables data skipping on chosen columns, particularly for fields containing arbitrary text
refer this code snipet to create bloom filter
CREATE BLOOMFILTER INDEX
ON [TABLE] table_name
[FOR COLUMNS( { columnName1 [ options ] } [, ...] ) ]
[ options ]
options
OPTIONS ( { key1 [ = ] val1 } [, ...] )
2-Z-ordering is a technique to colocate related information in the same set of files. This co-locality is automatically used by Delta Lake on Databricks data-skipping algorithms. This behavior dramatically reduces the amount of data that Delta Lake on Databricks needs to read. To Z-order data, you specify the columns to order on in the
ZORDER BY clause:
OPTIMIZE events
WHERE date >= current_timestamp() - INTERVAL 1 day
ZORDER BY (eventType)
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group