โ07-21-2025 02:10 PM
Hello all
Has anyone attempted to look at the internals of predictive optimization and built an in-home solution mimicking its functionality? I understood that there are no plans from Databricks to roll-out this feature for external tables, and hence, we were thinking to gather on our own the telemetry of frequently used columns and use that information for liquid clustering and gathering stats....
On the other hand, if Databricks can open source it, that would be really helpful...
โ07-21-2025 02:27 PM
You're touching on a really interesting area! While Databricks hasn't open-sourced predictive optimization,
there have been some community efforts and approaches to build similar functionality:
Community Efforts:
Yes, some teams build DIY solutions using Spark query logs and custom listeners
Focus on liquid clustering column selection and automated stats collection
No full open-source clone exists yet
Common Approaches:
Parse Spark History Server logs for column usage patterns
Custom EventListeners to capture query telemetry
Heuristic-based optimization scheduling
Reality Check:
Targeted solutions (clustering hints, stats automation) are feasible
Full predictive optimization replication is complex
Databricks hasn't indicated plans to open-source it
Bottom Line: Build incrementally - start with query pattern analysis for liquid clustering decisions, then expand based on ROI.
โ07-21-2025 10:48 PM - edited โ07-21-2025 10:55 PM
Hi @noorbasha534 ,
Thatโs a really cool idea and definitely shows initiative - but realistically, it might not be worth the effort. Thereโs a lot of engineering going on under the hood that would be tough to replicate in-house.
Collecting telemetry and using it for things like liquid clustering and stats gathering could work to some extent, but the effort required to build and maintain something similar would likely outweigh the benefits, especially given how deeply integrated and optimized the native solution is.
If you have external tables I would just take care of regular maintenance of the tables (etc. like running optimize/ vacuum regulary).
Would be awesome if Databricks open-sourced it, though - totally agree with you there.
โ07-21-2025 11:53 PM
@szymon_dybczak since liquid clustering only allows 4 columns to be set for now, I think I can go blindly with the primary keys here. In our case, we have wide tables with 300+ columns, and users are querying on columns that are not in the first 32 positions for which we are gathering stats, and the stats gathering is not really helping us.
โ07-22-2025 06:58 AM
HI, @noorbasha534
If you use DBR 13.3+, you can specify columns for which you would like to collect statistics with delta.dataSkippingStatsColumns
https://learn.microsoft.com/en-us/azure/databricks/delta/data-skipping
โ07-22-2025 04:48 AM
@LinlinH thanks for the details. Can you please share any Github link where the community work is put so I can verify if any code can be re-used...
Passionate about hosting events and connecting people? Help us grow a vibrant local communityโsign up today to get started!
Sign Up Now