Subscribeto the NextGenLakehouse to receive monthly updates!
In a nutshell
Open Data/Table format war is over — Delta & Iceberg will work together toward unification with the Tabular.io acquisition
Open Governanceis at hand with Unity Catalog being Open Sourced
Data is becoming more accessible than ever, with business users able to directly get insights or Analysts to build powerful pipelines — all assisted by AI.
Compound AIis the future — the single large/huge language model is a niche!
Unity Catalog is now Open Source
Under the hood, open Lakehouses are based on Open Format (Delta+Iceberg with Uniform —more detail on that in our next post). While this simplifies your data layer and doesn’t require external systems (such as hive metastore) to understand your data layout, Open Formats alone don’t provide modern requirements around security, ACL, discoverability and interoperability.
To solve these challenges, Databricks provides Unity Catalog, a unified Data+AI governance layer. However, with each vendor implementing its own catalog, the ecosystem is fragmented and not open, making it harder to build and deploy interoperable systems.
To solve this challenge and accelerate the entire ecosystem, Databricks releasedunitycatalog.io, an OSS catalog implementation for Data + AI. In a nutshell, Unity Catalog (Apache2, Linux Foundation project) provides:
Multi Engine:you can write your table with Spark and it will appear when you LIST the catalog in another engine such as duckDB
Data+AI: UC provides a single namespace for organizing and sharing tables, but also unstructured data, and AI assets. This means that you can LIST yourtables, but alsofilesormodelsfrom any systems supporting UC.
Support Iceberg REST catalog:the first release is already compatible with Iceberg catalog to access your tables
Supports credentialvendingto gate clients’ access to the underlying cloud storage for tables and volumes
Attribute-Based Access Control (ABAC) on Databricks UC!
One of my favorite features. Databricks now provides a very easy way to define policies based on tags. Just add tags on any column (ex: pii) and all the columns/rows will automatically be masked/ filtered. This is super easy to setup, and will be available soon!
Metrics added to Databricks UC!
UC will now supportmetrics. You can think about ametricas a function computing some outcome, like: what are my revenues, what is my churn, what is the EMEA region etc. These definitions differ for each business. When you define them within your catalog, it’s then easier to standardize your org to make sure they all use the same definition.
Furthermore, because the Data Intelligence Platform analyzes and understands these metrics, the engine can better generate certified answers, including using BI/AI capabilities and Genie Spaces. If a business user asks about Churn in EMEA, the engine will know what this means for your business and generate a proper SQL query to answer accordingly!
Metrics are like a user manual for your platform to understand your data!
Databricks Clean Rooms will soon be in public preview!
Clean rooms make it easy for companies to collaborate on data while not directly sharing the underlying data. This provides a safe, cross-cloud, cross-data platform environment to collaborate on any data while enforcing privacy.
As with all Databricks capabilities, Clean Rooms let you share data, notebooks, code, and AI models!
Delta Sharing for query federation!
You’ll be able to share your data with any other system leveraging the Delta Sharing OSS protocol. This will make it easier to build and interact with open systems supporting the Delta Sharing protocol!