Databricks Community

Brahmareddy · ‎10-08-2025

Hi All, How are you doing today?

I wanted to share something interesting from my recent Databricks work — I’ve been playing around with an idea I call “Real-Time Metadata Intelligence.” Most of us focus on optimizing data pipelines, query performance, or cluster costs, but very few of us look closely at the metadata our system already produces — the job runs, the schema changes, the partition growth, the query plans, and all those little hints that tell how healthy or tired our pipelines are.

So here’s what I did. I started capturing Unity Catalog events, Delta logs, and job run metadata into a simple Delta table. Then I used a small PySpark streaming job to keep this metadata table live and constantly updated. After that, I built a basic ML model that learns from the history — it looks at trends like table size growth, file counts, skew ratios, and processing times. The goal was to predict when a job might start slowing down before it actually fails or crosses SLA.

The early results were surprisingly good. The model caught changes in partition size hours before the regular Databricks job alert kicked in. It almost felt like the platform was teaching itself about its own performance patterns. That made me think — what if every Databricks workspace had its own “metadata brain”? Something that continuously learns from all table updates, queries, and workflows to alert engineers proactively instead of reactively.

I’m curious — has anyone here tried something similar? Maybe building a self-learning layer around Unity Catalog or Delta Lake metadata? Do you think Databricks should have something like this built-in — a “Smart Pipeline Monitor” that reads from its own logs and suggests optimizations in real time?

Would love to hear your thoughts. I truly believe metadata is one of the most underused treasures in our ecosystem it’s literally free intelligence waiting to be used. Let’s see what we can build with it

ruicarvalho_de · ‎10-17-2025

I like the core idea. You are mining signals the platform already emits.

I would start rules first, track small files ratio and average file size trend, watch skew per partition and shuffle bytes per input gigabyte. Compare job time to input size to catch drift, log schema changes and the time since the last OPTIMIZE or VACUUM. This gives quick wins.

Use alerts that explain the why and the fix. If small files rise, suggest compaction or file aggregation. If skew grows, point to repartition or a better key.

Etc...

If Databricks shipped something here, I would want built in rules on system tables, a simple scorer that runs on each commit or job end and human-readable tips linked to each rule.

Rui Carvalho
Data Engineer