- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ12-28-2022 04:05 PM
Have gone through the documentation, still cannot understand it.
How is bloom filter indexing a column different from z ordering a column?
Can somebody explain to me what exactly happens while these two techniques are applied?
- Labels:
-
Bloom Filter
-
Optimization
-
Zorder
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ12-29-2022 12:28 AM
hey @Daniel Sahalโ
1-A Bloomfilter index is a space-efficient data structure that enables data skipping on chosen columns, particularly for fields containing arbitrary text
refer this code snipet to create bloom filter
CREATE BLOOMFILTER INDEX
ON [TABLE] table_name
[FOR COLUMNS( { columnName1 [ options ] } [, ...] ) ]
[ options ]
options
OPTIONS ( { key1 [ = ] val1 } [, ...] )
2-Z-ordering is a technique to colocate related information in the same set of files. This co-locality is automatically used by Delta Lake on Databricks data-skipping algorithms. This behavior dramatically reduces the amount of data that Delta Lake on Databricks needs to read. To Z-order data, you specify the columns to order on in the
ZORDER BY clause:
OPTIMIZE events
WHERE date >= current_timestamp() - INTERVAL 1 day
ZORDER BY (eventType)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ12-28-2022 10:28 PM
Bloom filter is like a looking for a needle in the haystack (with FPP), so it's more useful for strings.
Z-Order is best with a couple of columns that are used for filters/joins.
They can run independently of each other or work together.
See example here:
https://www.mssqltips.com/sqlservertip/6968/bloom-filter-indexes-using-databricks-delta/
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ12-29-2022 02:12 AM
In the example, bloom filter is also used for filters.
How do we decide the columns to be indexed and z ordered? Based on data type String or Non-String?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ12-29-2022 12:28 AM
hey @Daniel Sahalโ
1-A Bloomfilter index is a space-efficient data structure that enables data skipping on chosen columns, particularly for fields containing arbitrary text
refer this code snipet to create bloom filter
CREATE BLOOMFILTER INDEX
ON [TABLE] table_name
[FOR COLUMNS( { columnName1 [ options ] } [, ...] ) ]
[ options ]
options
OPTIONS ( { key1 [ = ] val1 } [, ...] )
2-Z-ordering is a technique to colocate related information in the same set of files. This co-locality is automatically used by Delta Lake on Databricks data-skipping algorithms. This behavior dramatically reduces the amount of data that Delta Lake on Databricks needs to read. To Z-order data, you specify the columns to order on in the
ZORDER BY clause:
OPTIMIZE events
WHERE date >= current_timestamp() - INTERVAL 1 day
ZORDER BY (eventType)

