<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic in-home built predictive optimization in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/in-home-built-predictive-optimization/m-p/125902#M47571</link>
    <description>&lt;P&gt;Hello all&lt;/P&gt;&lt;P&gt;Has anyone attempted to look at the internals of&amp;nbsp;predictive optimization and built an in-home solution mimicking its functionality? I understood that there are no plans from Databricks to roll-out this feature for external tables, and hence, we were thinking to gather on our own the telemetry of frequently used columns and use that information for liquid clustering and gathering stats....&lt;/P&gt;&lt;P&gt;On the other hand, if Databricks can open source it, that would be really helpful...&lt;/P&gt;</description>
    <pubDate>Mon, 21 Jul 2025 21:10:50 GMT</pubDate>
    <dc:creator>noorbasha534</dc:creator>
    <dc:date>2025-07-21T21:10:50Z</dc:date>
    <item>
      <title>in-home built predictive optimization</title>
      <link>https://community.databricks.com/t5/data-engineering/in-home-built-predictive-optimization/m-p/125902#M47571</link>
      <description>&lt;P&gt;Hello all&lt;/P&gt;&lt;P&gt;Has anyone attempted to look at the internals of&amp;nbsp;predictive optimization and built an in-home solution mimicking its functionality? I understood that there are no plans from Databricks to roll-out this feature for external tables, and hence, we were thinking to gather on our own the telemetry of frequently used columns and use that information for liquid clustering and gathering stats....&lt;/P&gt;&lt;P&gt;On the other hand, if Databricks can open source it, that would be really helpful...&lt;/P&gt;</description>
      <pubDate>Mon, 21 Jul 2025 21:10:50 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/in-home-built-predictive-optimization/m-p/125902#M47571</guid>
      <dc:creator>noorbasha534</dc:creator>
      <dc:date>2025-07-21T21:10:50Z</dc:date>
    </item>
    <item>
      <title>Re: in-home built predictive optimization</title>
      <link>https://community.databricks.com/t5/data-engineering/in-home-built-predictive-optimization/m-p/125906#M47575</link>
      <description>&lt;P&gt;Hi &lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/124839"&gt;@noorbasha534&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;You're touching on a really interesting area! While Databricks hasn't open-sourced predictive optimization,&lt;BR /&gt;there have been some community efforts and approaches to build similar functionality:&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Community Efforts:&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;Yes, some teams build DIY solutions using Spark query logs and custom listeners&lt;BR /&gt;Focus on liquid clustering column selection and automated stats collection&lt;BR /&gt;No full open-source clone exists yet&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Common Approaches:&lt;/STRONG&gt;&lt;BR /&gt;Parse Spark History Server logs for column usage patterns&lt;BR /&gt;Custom EventListeners to capture query telemetry&lt;BR /&gt;Heuristic-based optimization scheduling&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Reality Check:&lt;/STRONG&gt;&lt;BR /&gt;Targeted solutions (clustering hints, stats automation) are feasible&lt;BR /&gt;Full predictive optimization replication is complex&lt;BR /&gt;Databricks hasn't indicated plans to open-source it&lt;/P&gt;&lt;P&gt;Bottom Line: Build incrementally - start with query pattern analysis for liquid clustering decisions, then expand based on ROI.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 21 Jul 2025 21:27:34 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/in-home-built-predictive-optimization/m-p/125906#M47575</guid>
      <dc:creator>lingareddy_Alva</dc:creator>
      <dc:date>2025-07-21T21:27:34Z</dc:date>
    </item>
    <item>
      <title>Re: in-home built predictive optimization</title>
      <link>https://community.databricks.com/t5/data-engineering/in-home-built-predictive-optimization/m-p/125943#M47584</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/124839"&gt;@noorbasha534&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;&lt;P&gt;That’s a really cool idea and definitely shows initiative - but realistically, it might not be worth the effort. There’s a lot of engineering going on under the hood that would be tough to replicate in-house.&lt;/P&gt;&lt;P&gt;Collecting telemetry and using it for things like liquid clustering and stats gathering could work to some extent, but the effort required to build and maintain something similar would likely outweigh the benefits, especially given how deeply integrated and optimized the native solution is.&lt;BR /&gt;If you have external tables I would just take care of regular maintenance of the tables (etc. like running optimize/ vacuum regulary).&lt;/P&gt;&lt;P&gt;Would be awesome if Databricks open-sourced it, though - totally agree with you there.&lt;/P&gt;</description>
      <pubDate>Tue, 22 Jul 2025 05:55:08 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/in-home-built-predictive-optimization/m-p/125943#M47584</guid>
      <dc:creator>szymon_dybczak</dc:creator>
      <dc:date>2025-07-22T05:55:08Z</dc:date>
    </item>
    <item>
      <title>Re: in-home built predictive optimization</title>
      <link>https://community.databricks.com/t5/data-engineering/in-home-built-predictive-optimization/m-p/125950#M47588</link>
      <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/110502"&gt;@szymon_dybczak&lt;/a&gt;&amp;nbsp;since liquid clustering only allows 4 columns to be set for now, I think I can go blindly with the primary keys here. In our case, we have wide tables with 300+ columns, and users are querying on columns that are not in the first 32 positions for which we are gathering stats, and the stats gathering is not really helping us.&lt;/P&gt;</description>
      <pubDate>Tue, 22 Jul 2025 06:53:43 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/in-home-built-predictive-optimization/m-p/125950#M47588</guid>
      <dc:creator>noorbasha534</dc:creator>
      <dc:date>2025-07-22T06:53:43Z</dc:date>
    </item>
    <item>
      <title>Re: in-home built predictive optimization</title>
      <link>https://community.databricks.com/t5/data-engineering/in-home-built-predictive-optimization/m-p/125989#M47603</link>
      <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/104707"&gt;@LinlinH&lt;/a&gt;&amp;nbsp;thanks for the details. Can you please share any Github link where the community work is put so I can verify if any code can be re-used...&lt;/P&gt;</description>
      <pubDate>Tue, 22 Jul 2025 11:48:53 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/in-home-built-predictive-optimization/m-p/125989#M47603</guid>
      <dc:creator>noorbasha534</dc:creator>
      <dc:date>2025-07-22T11:48:53Z</dc:date>
    </item>
    <item>
      <title>Re: in-home built predictive optimization</title>
      <link>https://community.databricks.com/t5/data-engineering/in-home-built-predictive-optimization/m-p/125999#M47608</link>
      <description>&lt;P&gt;HI,&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/124839"&gt;@noorbasha534&lt;/a&gt;&amp;nbsp;&lt;BR /&gt;If you use DBR&amp;nbsp;&lt;SPAN&gt;13.3+, you can specify columns for which you would like to collect statistics with&amp;nbsp;delta.dataSkippingStatsColumns&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;A href="https://learn.microsoft.com/en-us/azure/databricks/delta/data-skipping" target="_blank" rel="noopener"&gt;https://learn.microsoft.com/en-us/azure/databricks/delta/data-skipping&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 22 Jul 2025 13:58:23 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/in-home-built-predictive-optimization/m-p/125999#M47608</guid>
      <dc:creator>alsetr</dc:creator>
      <dc:date>2025-07-22T13:58:23Z</dc:date>
    </item>
  </channel>
</rss>

