cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Streaming Solution

ck7007
New Contributor II

Maintain Zonemaps with Streaming Writes 

Challenge: Streaming breaks zonemaps due to constant micro-batches.

Solution: Incremental Updates
def write_streaming_with_zonemap(stream_df, table_path):
def update_zonemap(batch_df, batch_id):
# Write data
batch_df.write.format("iceberg").mode("append").save(table_path)

# Calculate batch zonemap
stats = batch_df.agg(
F.min("col").alias("min"),
F.max("col").alias("max")
).collect()[0]

# Append to incremental zonemap (not master)
zonemap_entry = {
'batch_id': batch_id,
'min': stats['min'],
'max': stats['max'],
'timestamp': datetime.now()
}

# Merge to master every 100 batches
if batch_id % 100 == 0:
merge_incremental_to_master()

return stream_df.writeStream \
.foreachBatch(update_zonemap) \
.trigger(processingTime="10 seconds") \
.start()

Results

  • Streaming throughput: 43K records/sec (only 4% overhead)
  • Query performance: 23.4s โ†’ 2.1s (91% improvement)

Critical lesson: Never update the master zonemap on every batch!

Working on ML-driven index selection next. Interested in collaboration?

1 REPLY 1

ManojkMohan
Contributor III

@ck7007   Yes i am interested to collaborate  . AM stucturing the problem like below

the challenge is: How can we leverage the query performance benefits of zonemaps without sacrificing the ingestion performance of a streaming pipeline? 

Problem Statement: The Streaming Indexing Dilemma

In large-scale data systems, zonemaps are a vital optimization tool. They store metadataโ€”typically the minimum and maximum valuesโ€”for columns within each data file. When a query is executed (e.g., SELECT * FROM table WHERE col > 100), the query engine first consults the zonemap. If a file's zonemap indicates its maximum value for col is 90, the engine knows it can skip reading that entire file, drastically reducing I/O and improving query speed.

The problem arises with streaming data:

  • High-Frequency, Small Writes: Streaming jobs, like the one using trigger(processingTime="10 seconds"), write data in frequent, small "micro-batches." This results in the creation of many small data files.
  • Metadata Bottleneck: If the system tried to update a single, "master" zonemap for the entire table with every micro-batch, the metadata update would become a severe bottleneck. This is a classic high-contention problem where many concurrent writers are trying to update a single, centralized resource. The cost of locking and updating the master index would overwhelm the cost of the actual data write, destroying the throughput of the streaming pipeline. If you agree with this thoguht process can brain storm on potential solution options ...

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local communityโ€”sign up today to get started!

Sign Up Now