Databricks Advent Calendar 2025 #20
As Unity Catalog becomes an enterprise catalog, bring-your-own lineage is one of my favorite features.
- 71 Views
- 0 replies
- 0 kudos
As Unity Catalog becomes an enterprise catalog, bring-your-own lineage is one of my favorite features.
In 2025, Metrics Views are becoming the standard way to define business logic once and reuse it everywhere. Instead of repeating complex SQL, teams can work with clean, consistent metrics.
Automatic file retention in the autoloader is one of my favourite new features of 2025. Automatically move cloud files to cold storage or just delete.
Thanks for sharing @Hubert-Dudek ! That's a really great feature. It simplified a lot data maintenance process at one of my clients
Replacing records for the entire date with newly arriving data for the given date is a typical design pattern. Now, thanks to simple REPLACE USING in Databricks, it is easier than ever!
Real-time mode is a breakthrough that lets Spark utilize all available CPUs to process records with single-millisecond latency, while decoupling checkpointing from per-record processing.
For many data engineers who love PySpark, the most significant improvement of 2025 was the addition of merge to the dataframe API, so no more Delta library or SQL is needed to perform MERGE. p.s. I still prefer SQL MERGE inside spark.sql()
New Lakakebase experience is a game-changer for transactional databases. That functionality is fantastic. Autoscaling to zero makes it really cost-effective. Do you need to deploy to prod? Just branch the production database to the release branch, an...
Ingestion from SharePoint is now available directly in PySpark. Just define a connection and use spark-read or, even better, spark-readStream with an autoloader. Just specify the file type and options for that file (pdf, csv, Excel, etc.)
Excel The big news this week is the possibility of native importing Excel files. Write operations are also possible. There is a possibility of choosing a data range. It also works with the streaming autoloader, currently in beta. GPT 5.2 The same day...
ZeroBus changes the game: you can now push event data directly into Databricks, even from on-prem. No extra event layer needed. Every Unity Catalog table can act as an endpoint.
Databricks goes native on Excel. You can now ingest + query .xls/.xlsx directly in Databricks (SQL + PySpark, batch and streaming), with auto schema/type inference, sheet + cell-range targeting, and evaluated formulas, no extra libraries anymore.
Tags, whether manually assigned or automatically assigned by the “data classification” service, can be protected using policies. Column masking can automatically mask columns with a given tag for all except some with elevated access.
Data classification automatically tags Unity Catalog tables and is now available in system tables as well.
Imagine all a data engineer or analyst needs to do to read from a REST API is use spark.read(), no direct request calls, no manual JSON parsing - just spark .read. That’s the power of a custom Spark Data Source. Soon, we will see a surge of open-sour...
DBX is one of the most crucial projects of dblabs this year, and we can expect that more and more great checks from it will be supported natively in databricks. More about dbx on https://databrickslabs.github.io/dqx/