<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Achieved 87% Query Performance Improvement with Custom Zonemap Indexing in Data Governance</title>
    <link>https://community.databricks.com/t5/data-governance/achieved-87-query-performance-improvement-with-custom-zonemap/m-p/129850#M2592</link>
    <description>&lt;P class=""&gt;Thanks, glad you found it helpful.&lt;/P&gt;&lt;P class=""&gt;Your detailed Bloom filter implementation above is fantastic—especially the RDD-based grouping approach for file-level filters. Have you measured the memory overhead in production?&lt;/P&gt;&lt;P class=""&gt;I'm curious about your experience with false positive rates. We're currently at 0.01 but considering 0.001 for our high-value queries. What's worked best for you?&lt;/P&gt;&lt;P class=""&gt;Looking forward to more technical exchanges&lt;/P&gt;</description>
    <pubDate>Tue, 26 Aug 2025 20:00:20 GMT</pubDate>
    <dc:creator>ck7007</dc:creator>
    <dc:date>2025-08-26T20:00:20Z</dc:date>
    <item>
      <title>Achieved 87% Query Performance Improvement with Custom Zonemap Indexing</title>
      <link>https://community.databricks.com/t5/data-governance/achieved-87-query-performance-improvement-with-custom-zonemap/m-p/129845#M2589</link>
      <description>&lt;P class=""&gt;&lt;STRONG&gt;Problem:&lt;/STRONG&gt; Queries on our 100M+ record Iceberg tables were taking 45+ seconds.&lt;/P&gt;&lt;P class=""&gt;&lt;STRONG&gt;Solution:&lt;/STRONG&gt; Implemented lightweight zonemap indexing that tracks min/max values per file.&lt;/P&gt;&lt;P class=""&gt;&lt;STRONG&gt;Quick Implementation&lt;/STRONG&gt;&lt;/P&gt;&lt;P class=""&gt;def apply_zonemap_pruning(table_path, predicate_value):&lt;BR /&gt;# Load zonemap index&lt;BR /&gt;zonemap = spark.read.parquet(f"{table_path}/_zonemaps")&lt;BR /&gt;&lt;BR /&gt;# Filter files based on min/max values&lt;BR /&gt;relevant_files = zonemap.filter(&lt;BR /&gt;(zonemap.min_value &amp;lt;= predicate_value) &amp;amp;&lt;BR /&gt;(zonemap.max_value &amp;gt;= predicate_value)&lt;BR /&gt;).select("file_path").collect()&lt;BR /&gt;&lt;BR /&gt;# Read only relevant files instead of a full table scan&lt;BR /&gt;return spark.read.parquet(*[f.file_path for f in relevant_files])&lt;BR /&gt;&lt;STRONG&gt;Results&lt;/STRONG&gt;&lt;/P&gt;&lt;H2&gt;Results&lt;/H2&gt;&lt;UL class=""&gt;&lt;LI&gt;&lt;STRONG&gt;Before:&lt;/STRONG&gt; 42.3 seconds (scanning 1000 files)&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;After:&lt;/STRONG&gt; 5.4 seconds (scanning 12 files)&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Cost savings:&lt;/STRONG&gt; 87% reduction in compute&lt;/LI&gt;&lt;/UL&gt;&lt;P class=""&gt;&lt;STRONG&gt;Key insight:&lt;/STRONG&gt; Most queries only need 1-2% of files. Zonemap helps identify them instantly.&lt;/P&gt;&lt;P class=""&gt;We are currently adding Bloom filters for even better performance. Has anyone tried similar indexing strategies?&lt;/P&gt;&lt;P class=""&gt;Would love to hear your approaches!&lt;/P&gt;</description>
      <pubDate>Tue, 26 Aug 2025 17:39:38 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-governance/achieved-87-query-performance-improvement-with-custom-zonemap/m-p/129845#M2589</guid>
      <dc:creator>ck7007</dc:creator>
      <dc:date>2025-08-26T17:39:38Z</dc:date>
    </item>
    <item>
      <title>Re: Achieved 87% Query Performance Improvement with Custom Zonemap Indexing</title>
      <link>https://community.databricks.com/t5/data-governance/achieved-87-query-performance-improvement-with-custom-zonemap/m-p/129848#M2590</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/180185"&gt;@ck7007&lt;/a&gt;,&lt;BR /&gt;That’s a great optimization! You can also extend zonemap pruning to multiple predicates. For example, combine a date range with a categorical filter:&lt;/P&gt;&lt;LI-CODE lang="python"&gt;# Expanded example: range + extra column
relevant_files = zonemap.filter(
(zonemap.min_date &amp;lt;= query_end) &amp;amp;
(zonemap.max_date &amp;gt;= query_start) &amp;amp;
(zonemap.region == query_region)
).select("file_path").collect()&lt;/LI-CODE&gt;&lt;P&gt;Only files overlapping the date range and matching the region are read—making pruning even more selective.&lt;/P&gt;&lt;P&gt;On top of that, Bloom Filters nicely complement zonemaps for point lookups (exact keys). Below is a minimal, file-level Bloom workflow:&lt;/P&gt;&lt;P&gt;1) Build a Bloom index per file&lt;/P&gt;&lt;LI-CODE lang="python"&gt;from pyspark.sql.functions import col, lit
from pyspark.sql import Row

# pseudo helper: create_bloom(rows, column) -&amp;gt; returns a (file_path, bloom_bytes) iterator
# You can implement this with a library (e.g., pybloom) or a JVM Bloom impl via UDF/Scala.

def build_bloom_index(table_path, column="user_id"):
# Read the Iceberg table (or Parquet paths) so you can group by file
df = spark.read.format("iceberg").load(table_path)

# Make sure df has a column with the file path; if not, produce it from metadata
# Many engines expose input_file_name(); if needed:
df_with_file = df.withColumn("_file_path", spark.functions.input_file_name())

# Build one Bloom filter per file
bloom_index_rdd = (
df_with_file
.select("_file_path", column)
.rdd
.groupBy(lambda r: r["_file_path"]) # group rows by file
.map(lambda kv: (
kv[0], # file_path
create_bloom((row[column] for row in kv[1]), column) # your custom bloom builder -&amp;gt; bytes
))
.map(lambda t: Row(file_path=t[0], bloom_bytes=t[1]))
)

spark.createDataFrame(bloom_index_rdd) \
.write.mode("overwrite").parquet(f"{table_path}/_bloom")&lt;/LI-CODE&gt;&lt;P&gt;&lt;BR /&gt;2) Use the Bloom index for pruning&lt;/P&gt;&lt;LI-CODE lang="python"&gt;from pyspark.sql.functions import udf
from pyspark.sql.types import BooleanType

# pseudo helper: might_contain(bloom_bytes, key) -&amp;gt; bool
@udf(BooleanType())
def bloom_might_contain(bloom_bytes, key):
return bloom_lookup(bloom_bytes, key) # implement based on your bloom format

def apply_bloom_pruning(table_path, lookup_id):
# Load bloom index
bloom = spark.read.parquet(f"{table_path}/_bloom") # columns: file_path, bloom_bytes

# Keep only files that MIGHT contain the key (may include false positives, never false negatives)
candidate_files = (bloom
.filter(bloom_might_contain(bloom["bloom_bytes"], lit(lookup_id)))
.select("file_path")
.collect()
)

# Read only those candidate files
return spark.read.parquet(*[r.file_path for r in candidate_files])&lt;/LI-CODE&gt;&lt;P&gt;Why both?&lt;BR /&gt;Zonemap → best for ranges (e.g., date/time, monotonically increasing IDs).&lt;BR /&gt;Bloom Filter → best for point lookups (e.g., user_id = 123456).&lt;/P&gt;&lt;P&gt;Used together, they drastically shrink the scan set—zonemap narrows by ranges, Bloom narrows by exact keys.&lt;/P&gt;</description>
      <pubDate>Tue, 26 Aug 2025 18:27:16 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-governance/achieved-87-query-performance-improvement-with-custom-zonemap/m-p/129848#M2590</guid>
      <dc:creator>WiliamRosa</dc:creator>
      <dc:date>2025-08-26T18:27:16Z</dc:date>
    </item>
    <item>
      <title>Re: Achieved 87% Query Performance Improvement with Custom Zonemap Indexing</title>
      <link>https://community.databricks.com/t5/data-governance/achieved-87-query-performance-improvement-with-custom-zonemap/m-p/129849#M2591</link>
      <description>&lt;P&gt;tks for sharing&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/180185"&gt;@ck7007&lt;/a&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 26 Aug 2025 18:43:00 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-governance/achieved-87-query-performance-improvement-with-custom-zonemap/m-p/129849#M2591</guid>
      <dc:creator>WiliamRosa</dc:creator>
      <dc:date>2025-08-26T18:43:00Z</dc:date>
    </item>
    <item>
      <title>Re: Achieved 87% Query Performance Improvement with Custom Zonemap Indexing</title>
      <link>https://community.databricks.com/t5/data-governance/achieved-87-query-performance-improvement-with-custom-zonemap/m-p/129850#M2592</link>
      <description>&lt;P class=""&gt;Thanks, glad you found it helpful.&lt;/P&gt;&lt;P class=""&gt;Your detailed Bloom filter implementation above is fantastic—especially the RDD-based grouping approach for file-level filters. Have you measured the memory overhead in production?&lt;/P&gt;&lt;P class=""&gt;I'm curious about your experience with false positive rates. We're currently at 0.01 but considering 0.001 for our high-value queries. What's worked best for you?&lt;/P&gt;&lt;P class=""&gt;Looking forward to more technical exchanges&lt;/P&gt;</description>
      <pubDate>Tue, 26 Aug 2025 20:00:20 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-governance/achieved-87-query-performance-improvement-with-custom-zonemap/m-p/129850#M2592</guid>
      <dc:creator>ck7007</dc:creator>
      <dc:date>2025-08-26T20:00:20Z</dc:date>
    </item>
    <item>
      <title>Re: Achieved 87% Query Performance Improvement with Custom Zonemap Indexing</title>
      <link>https://community.databricks.com/t5/data-governance/achieved-87-query-performance-improvement-with-custom-zonemap/m-p/131177#M2601</link>
      <description>&lt;P class=""&gt;Hi!&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/180185"&gt;@ck7007&lt;/a&gt;&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/179612"&gt;@WiliamRosa&lt;/a&gt;&amp;nbsp;,&lt;BR /&gt;&lt;BR /&gt;I have a question — why are you actually doing this? I’m not fully familiar with your exact setup (Iceberg), but my understanding is that Iceberg already stores these stats (min/max) in the manifests, and Spark should be able to leverage them to skip files that don’t contain relevant data.&lt;/P&gt;&lt;P class=""&gt;&amp;nbsp;&lt;/P&gt;&lt;P class=""&gt;Could you share a bit more detail? For example:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P class=""&gt;Which engine version are you using?&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P class=""&gt;Did you check the Spark physical plan to verify whether the manifest pruning is being applied correctly?&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P class=""&gt;Is it possible that the query or engine isn’t pushing down the filter as expected?&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P class=""&gt;&amp;nbsp;&lt;/P&gt;&lt;P class=""&gt;As I understand it, the manifest tree should already allow you to only read the necessary files without doing a full scan. So I’m not sure if this is an engine limitation, a query issue, or something else I’m missing.&lt;/P&gt;&lt;P class=""&gt;If you could provide more context, that would be really valuable for everyone here &lt;span class="lia-unicode-emoji" title=":slightly_smiling_face:"&gt;🙂&lt;/span&gt; Thanks!&lt;BR /&gt;&lt;BR /&gt;Isi&lt;/P&gt;</description>
      <pubDate>Sun, 07 Sep 2025 19:44:34 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-governance/achieved-87-query-performance-improvement-with-custom-zonemap/m-p/131177#M2601</guid>
      <dc:creator>Isi</dc:creator>
      <dc:date>2025-09-07T19:44:34Z</dc:date>
    </item>
  </channel>
</rss>

