Databricks Community

noorbasha534 · ‎07-21-2025

Hello all

Has anyone attempted to look at the internals of predictive optimization and built an in-home solution mimicking its functionality? I understood that there are no plans from Databricks to roll-out this feature for external tables, and hence, we were thinking to gather on our own the telemetry of frequently used columns and use that information for liquid clustering and gathering stats....

On the other hand, if Databricks can open source it, that would be really helpful...

lingareddy_Alva · ‎07-21-2025

Hi @noorbasha534

You're touching on a really interesting area! While Databricks hasn't open-sourced predictive optimization,
there have been some community efforts and approaches to build similar functionality:

Community Efforts:

Yes, some teams build DIY solutions using Spark query logs and custom listeners
Focus on liquid clustering column selection and automated stats collection
No full open-source clone exists yet

Common Approaches:
Parse Spark History Server logs for column usage patterns
Custom EventListeners to capture query telemetry
Heuristic-based optimization scheduling

Reality Check:
Targeted solutions (clustering hints, stats automation) are feasible
Full predictive optimization replication is complex
Databricks hasn't indicated plans to open-source it

Bottom Line: Build incrementally - start with query pattern analysis for liquid clustering decisions, then expand based on ROI.

LR

szymon_dybczak · ‎07-21-2025

Hi @noorbasha534 ,

That’s a really cool idea and definitely shows initiative - but realistically, it might not be worth the effort. There’s a lot of engineering going on under the hood that would be tough to replicate in-house.

Collecting telemetry and using it for things like liquid clustering and stats gathering could work to some extent, but the effort required to build and maintain something similar would likely outweigh the benefits, especially given how deeply integrated and optimized the native solution is.
If you have external tables I would just take care of regular maintenance of the tables (etc. like running optimize/ vacuum regulary).

Would be awesome if Databricks open-sourced it, though - totally agree with you there.

noorbasha534 · ‎07-21-2025

@szymon_dybczak since liquid clustering only allows 4 columns to be set for now, I think I can go blindly with the primary keys here. In our case, we have wide tables with 300+ columns, and users are querying on columns that are not in the first 32 positions for which we are gathering stats, and the stats gathering is not really helping us.

alsetr · ‎07-22-2025

HI, @noorbasha534
If you use DBR 13.3+, you can specify columns for which you would like to collect statistics with delta.dataSkippingStatsColumns

https://learn.microsoft.com/en-us/azure/databricks/delta/data-skipping

noorbasha534 · ‎07-22-2025

@LinlinH thanks for the details. Can you please share any Github link where the community work is put so I can verify if any code can be re-used...

Databricks Community

in-home built predictive optimization

Join Us as a Local Community Builder!

🌟 Community Pulse: Your Weekly Roundup! November 21 – 27, 2025

Join us for another BrickTalk: Vibe-Coding Databricks Apps in Replit with Augusto!

Celebrating Our First Brickster Champion: Louis Frolio

⭐ Setup Spark with Hadoop Anywhere : A DBR aligned local Spark+HDFS+Hive stack on Docker⭐

Big Book of Data Engineering - Get how-tos, code snippets and real-world examples