I'm prototyping a cluster cost / right-sizing advisor and wanted to get a reality-check from people running Databricks at real scale before I sink more time into it.
The main thing I'm chasing is Photon fallback. Photon quietly drops to the JVM on unsupported ops (Python UDFs, some struct/array predicates, a few Delta features), so you keep paying the Photon DBU premium while getting JVM speed, and as far as I can tell it's basically invisible in the UI. Alongside that, the usual right-sizing stuff: over-provisioned workers/driver, idle clusters.
Where I've got to so far:
In-cluster collection — a bundled JVM QueryExecutionListener reads executedPlan and only flags real fallback (mid-plan ColumnarToRow, RowToColumnar round-trips, BatchEvalPython/ArrowEvalPython), ignoring the benign terminal ColumnarToRow that every query ends on. A SparkListener grabs the executor curve and stage/task timing. It self-arms at interpreter startup via a .pth.
System tables for ground truth — node_timeline (CPU/mem, P95), compute.clusters (config/autoscale), billing.usage × list_prices (billed cost).
Engine — classify FIXED/AUTOSCALE, then a step (won't suggest a downsize if peak CPU/mem is high, there's memory spill, or the evidence is thin), then cost (billed when I have the grant, else modeled), and runtime impact as a bounded range rather than a single number.
The stuff I'm actually stuck on:
- Shared & serverless seal the JVM (Spark Connect), so the listener can't attach and I get nothing for Photon on standard/shared access mode. Has anyone found a supported way to see Photon fallback there (query profile API, system.query.history, something on the roadmap), or is in-process really the only path today?
- Rolling the listener out across a fleet — what's the least-intrusive pattern your platform team would actually sign off on? Cluster policy + allowlisted library, a global init script, spark.sql.queryExecutionListeners via policy (and does that even register on shared mode for you)?
- Observation window for right-sizing — to avoid recommending a downsize right before a weekend/month-end batch, what do you trust: a fixed N-day window that's guaranteed to cover the business cycle, or a peak-based sample gate? Curious what's held up in practice.
- system.billing per-cluster attribution — any gotchas joining usage → list_prices (price-window edges, how complete usage_metadata.cluster_id is, serverless SKUs)?
And the two I most want opinions on:
- If you're already fighting cluster cost, what's the part that's still annoying and unsolved? (idle detection, autoscaling tuning, ephemeral job-cluster sprawl, Photon ROI, spot/driver sizing, whatever it is.)
- Does this already overlap something that does it well (system-tables dashboards, Overwatch, a third-party FinOps tool)? And if so, where's the gap that's still worth filling?
Not selling anything, just trying to work out whether I'm reinventing a wheel or if there's a real gap here. Happy to be told it's the former.
Data Engineer | Apache Spark | Delta Lake | Databricks