Databricks Community

lara_rachidi · ‎07-23-2024

Subscribe to the NextGenLakehouse to receive monthly updates!

In a nutshell

Open Data/Table format war is over — Delta & Iceberg will work together toward unification with the Tabular.io acquisition
Open Governance is at hand with Unity Catalog being Open Sourced
Data is becoming more accessible than ever, with business users able to directly get insights or Analysts to build powerful pipelines — all assisted by AI.
Compound AI is the future — the single large/huge language model is a niche!

Under the hood, open Lakehouses are based on Open Format (Delta+Iceberg with Uniform — more detail on that in our next post). While this simplifies your data layer and doesn’t require external systems (such as hive metastore) to understand your data layout, Open Formats alone don’t provide modern requirements around security, ACL, discoverability and interoperability.
To solve these challenges, Databricks provides Unity Catalog, a unified Data+AI governance layer. However, with each vendor implementing its own catalog, the ecosystem is fragmented and not open, making it harder to build and deploy interoperable systems.

To solve this challenge and accelerate the entire ecosystem, Databricks released unitycatalog.io , an OSS catalog implementation for Data + AI. In a nutshell, Unity Catalog (Apache2, Linux Foundation project) provides:
Multi Engine: you can write your table with Spark and it will appear when you LIST the catalog in another engine such as duckDB
Data+AI: UC provides a single namespace for organizing and sharing tables, but also unstructured data, and AI assets. This means that you can LIST your tables, but also files or models from any systems supporting UC.
Support Iceberg REST catalog: the first release is already compatible with Iceberg catalog to access your tables
Supports credential vending to gate clients’ access to the underlying cloud storage for tables and volumes
More details: https://www.databricks.com/blog/open-sourcing-unity-catalog, or watch Matei open source the project live (without waiting 90 days like others would do — in case you missed the troll/drama)
Join the community now: https://www.unitycatalog.io

One of my favorite features. Databricks now provides a very easy way to define policies based on tags. Just add tags on any column (ex: pii) and all the columns/rows will automatically be masked/ filtered. This is super easy to setup, and will be available soon!

UC will now support metrics. You can think about a metric as a function computing some outcome, like: what are my revenues, what is my churn, what is the EMEA region etc. These definitions differ for each business. When you define them within your catalog, it’s then easier to standardize your org to make sure they all use the same definition.
Furthermore, because the Data Intelligence Platform analyzes and understands these metrics, the engine can better generate certified answers, including using BI/AI capabilities and Genie Spaces.
If a business user asks about Churn in EMEA, the engine will know what this means for your business and generate a proper SQL query to answer accordingly!
Metrics are like a user manual for your platform to understand your data!

Clean rooms make it easy for companies to collaborate on data while not directly sharing the underlying data. This provides a safe, cross-cloud, cross-data platform environment to collaborate on any data while enforcing privacy.
As with all Databricks capabilities, Clean Rooms let you share data, notebooks, code, and AI models!

You’ll be able to share your data with any other system leveraging the Delta Sharing OSS protocol. This will make it easier to build and interact with open systems supporting the Delta Sharing protocol!

Follow us on Linkedin: Quentin & Youssef & Lara & Maria & Beatrice