cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

in-home built predictive optimization

noorbasha534
Valued Contributor II

Hello all

Has anyone attempted to look at the internals of predictive optimization and built an in-home solution mimicking its functionality? I understood that there are no plans from Databricks to roll-out this feature for external tables, and hence, we were thinking to gather on our own the telemetry of frequently used columns and use that information for liquid clustering and gathering stats....

On the other hand, if Databricks can open source it, that would be really helpful...

5 REPLIES 5

lingareddy_Alva
Honored Contributor III

Hi @noorbasha534 

You're touching on a really interesting area! While Databricks hasn't open-sourced predictive optimization,
there have been some community efforts and approaches to build similar functionality:

Community Efforts:

Yes, some teams build DIY solutions using Spark query logs and custom listeners
Focus on liquid clustering column selection and automated stats collection
No full open-source clone exists yet

Common Approaches:
Parse Spark History Server logs for column usage patterns
Custom EventListeners to capture query telemetry
Heuristic-based optimization scheduling

Reality Check:
Targeted solutions (clustering hints, stats automation) are feasible
Full predictive optimization replication is complex
Databricks hasn't indicated plans to open-source it

Bottom Line: Build incrementally - start with query pattern analysis for liquid clustering decisions, then expand based on ROI.

 

 

LR

Hi @noorbasha534 ,

Thatโ€™s a really cool idea and definitely shows initiative - but realistically, it might not be worth the effort. Thereโ€™s a lot of engineering going on under the hood that would be tough to replicate in-house.

Collecting telemetry and using it for things like liquid clustering and stats gathering could work to some extent, but the effort required to build and maintain something similar would likely outweigh the benefits, especially given how deeply integrated and optimized the native solution is.
If you have external tables I would just take care of regular maintenance of the tables (etc. like running optimize/ vacuum regulary).

Would be awesome if Databricks open-sourced it, though - totally agree with you there.

noorbasha534
Valued Contributor II

@szymon_dybczak since liquid clustering only allows 4 columns to be set for now, I think I can go blindly with the primary keys here. In our case, we have wide tables with 300+ columns, and users are querying on columns that are not in the first 32 positions for which we are gathering stats, and the stats gathering is not really helping us.

alsetr
New Contributor III

HI, @noorbasha534 
If you use DBR 13.3+, you can specify columns for which you would like to collect statistics with delta.dataSkippingStatsColumns

https://learn.microsoft.com/en-us/azure/databricks/delta/data-skipping

noorbasha534
Valued Contributor II

@LinlinH thanks for the details. Can you please share any Github link where the community work is put so I can verify if any code can be re-used...