When should I create a Bloom Filter Index on my Delta table?

User16826992666
Databricks Employee
Databricks Employee
 

Ryan_Chynoweth
Databricks Employee
Databricks Employee

A bloom filter index is a space-efficient data structure that enables data skipping on chosen columns, particularly for fields containing arbitrary text. The Bloom filter operates by either stating that data is definitively not in the file, or that it is probably in the file, with a defined false positive probability (FPP).

The biggest reason for using a bloom filter when you often query on a specific set of columns. An example use case is when you have a large table and try to query a small subset of the data, which helps in “needle in a haystack” queries.

View solution in original post