@ck7007 Yes i am interested to collaborate . AM stucturing the problem like below
the challenge is: How can we leverage the query performance benefits of zonemaps without sacrificing the ingestion performance of a streaming pipeline?
Problem Statement: The Streaming Indexing Dilemma
In large-scale data systems, zonemaps are a vital optimization tool. They store metadataโtypically the minimum and maximum valuesโfor columns within each data file. When a query is executed (e.g., SELECT * FROM table WHERE col > 100), the query engine first consults the zonemap. If a file's zonemap indicates its maximum value for col is 90, the engine knows it can skip reading that entire file, drastically reducing I/O and improving query speed.
The problem arises with streaming data:
- High-Frequency, Small Writes: Streaming jobs, like the one using trigger(processingTime="10 seconds"), write data in frequent, small "micro-batches." This results in the creation of many small data files.
- Metadata Bottleneck: If the system tried to update a single, "master" zonemap for the entire table with every micro-batch, the metadata update would become a severe bottleneck. This is a classic high-contention problem where many concurrent writers are trying to update a single, centralized resource. The cost of locking and updating the master index would overwhelm the cost of the actual data write, destroying the throughput of the streaming pipeline. If you agree with this thoguht process can brain storm on potential solution options ...