Databricks Community

dfrozo · ‎11-07-2024

Unity Catalog is the foundation of all data governance-related aspects of the Databricks Data and Intelligence platform. It provides a unified, centralized solution for key enterprise capabilities, including data discoverability, data lineage, auditing, data classification, data privacy, and access control.

At Databricks, we recognize the value of our Technology Partners or Integrated Software Vendors (ISVs), who help us extend Lakehouse's value to customers. ISVs expand the horizon of the Enterprise-wide Governance domain by providing valuable integrations with Unity Catalog.

In this blog, we define each data governance domain, its importance, and available features in Unity Catalog. We also provide a few ISV integration examples with Unity Catalog mapped to key Data Governance capabilities. We explore how these integrations relate to six different Data Governance domains as illustrated below.

Data Discoverability

Data discoverability refers to the processes and techniques used to identify, understand, and document data assets within an organization. With the emergence of modern data assets like dashboards, machine learning models, queries, libraries, and notebooks, it has become critical for businesses to discover their assets robustly.

By making data assets easily accessible and searchable, data discoverability reduces the time and effort required to locate and retrieve data, thus improving operational efficiency and reducing cost. Furthermore, robust data discoverability facilitates data-driven decision-making that provides a competitive advantage.

In Databricks, Unity Catalog Data Catalog Explorer provides a way to discover data and AI assets, including notebooks, dashboards, and models. Users can leverage AI-powered personalized search tools on all assets. The Catalog Explorer also allows users to view metadata, such as comments, data lineage, and data insights, which helps them understand the context and usage of data assets. The illustration below depicts Unity Catalog features for enhanced search and discovery.

Databricks integrates with several Enterprise cataloging systems to provide them with a way to access metadata about different data assets from Unity Catalog and register them in their solution. A few integration examples:

Privacera: Integrates with Databricks Unity Catalog to automatically synchronize metadata to PrivaceraCloud using the native connector via Partner connect.
Microsoft Purview: Scans and pulls the metadata using Purview connector to Databricks.
Alation: Collects catalog objects from Unity Catalog-enabled workspaces by leveraging metadata extraction through both interactive clusters and SQL warehouses.

Data Quality

Data quality refers to the extent to which an organization's information assets meet predefined standards of excellence, which are defined by attributes such as accuracy, completeness, consistency, timeliness, and relevance.

Data quality enables more informed and accurate decision-making by providing reliable information for analysis and strategic planning. High-quality data also enhances operational efficiency by reducing errors and streamlining processes across the organization. Lastly, it improves customer regulatory compliance and risk management.

Databricks Unity Catalog offers several features and products that contribute to data quality:

Native Delta Lake features: ACID transactions ensure data integrity and consistency, time travel-based rollbacks revert data back to previous states, and schema enforcement ensures data adheres to predefined structures and definitions.
Lakehouse monitoring: Provides out-of-the-box quality metrics for data and AI assets.
Delta Live Tables (DLT): Designed to create robust, quality-controlled data pipelines by establishing data constraints and expectations.

Databricks integrates with other data governance solutions in the data quality area for data governance. Integrations can be carried out in two primary ways: pull or push-based. The former focuses on metadata to be pulled, as mentioned in the previous section, and scan data quality. For example, Microsoft Purview enables data quality scans for Databricks Unity Catalog tables. It allows users to assess the quality of data stored in Databricks.

On the other hand, the pull method allows data quality rules to be pushed down to Unity Catalog. For example, Collibra offers data quality pushdown for Databricks, allowing data quality computations to be performed directly within Databricks, reducing data movement while improving processing efficiency. Comparably, Informatica allows the automated creation of data quality rules that can be applied directly to data in Unity Catalog, enhancing the overall data quality management process.

Auditing Data Entitlements and Access

Auditing data access involves assessing who can access specific data, the conditions for access, and tracking any changes. Auditing helps organizations safeguard their data, ensure compliance, and optimize data management processes. By maintaining a comprehensive audit trail, organizations can monitor access patterns and detect unauthorized access or modifications, thereby enhancing data security. Then, conducting regular audits helps to ensure that data access controls meet regulatory requirements, reducing the risk of non-compliance and associated penalties.

In order to support customer needs for data auditing, Databricks offers several capabilities. Databricks offers audit logging features that provide detailed records of user activities, allowing customers to integrate this service with their monitoring systems. With Unity Catalog, data access changes, across multiple workspaces belonging to each metastore, are logged to an audit table in in the system catalog.

Security partners like Immuta can be integrated with Databricks to enhance data auditing and reporting capabilities. Immuta provides real-time insights into data access and usage, helping maintain a transparent audit trail and ensuring compliance with data privacy regulations.

Data Classification

Data classification refers to an organization's capability to categorize data based on multiple criteria, such as sensitivity, level of privacy, or business value. Its role in data governance is crucial as it facilitates the adoption of data privacy and access control mechanisms for the entire data ecosystem within organizations. Data classification improves data governance by aiding risk assessment and mitigation, simplifying compliance, and adding relevant context to data assets aligned with business needs.

Users can implement Data classification in Databricks by leveraging the Tags functionality in Unity Catalog as illustrated in the image below. Tags simplify the search and discoverability of securable objects from a workspace, such as catalogs, schemas, tables, table columns, volumes, views, registered models, and model versions.

Amongst our ISV solution providers, Privacera’s Data Discovery module provides users with discovery and classification capabilities integrated with Unity Catalog. The solution offers automated data classification for both structured and unstructured data, utilizing Machine Learning models to generate, identify, and tag data before enriched metadata collections are added to Unity Catalog. Users can leverage Databricks Partner Connect to integrate with Privacera.

Immuta, on the other hand, offers a Sensitive Data Classification feature fully compatible with Unity Catalog APIs. It enriches user metadata, creates, classifies, and applies tags to sensitive data, and leverages Unity lineage for tag propagation. We cannot emphasize enough the importance of thorough data discovery and classification, as they form a strong basis for establishing better data privacy and access controls.

Data Privacy and Access Control

After implementing proper mechanisms for data discoverability and classification, data privacy comes into play with the design and enforcement of access control policies. Having security policies in place helps platform administrators and data stewards prevent unauthorized access and misuse of highly sensitive data such as PII (Personal Identifiable Information) or critical business value data such as IP (Intellectual Property). As a result, organizations can be more confident in their regulatory and compliance posture and their capacity for handling sensitive data throughout its entire lifecycle i.e., collection, retention, and disposal.

Databricks offers a robust set of capabilities for data privacy and access control across securable objects in Unity Catalog. Privileges, as shown in the image below, in combination with secured objects such as schemas, Tables, Views, Functions, and Volumes, allow admins to implement access control across the entire Databricks Data Intelligence Platform. In addition, Unity Catalog allows administrators to implement fine-grained access controls via dynamic views, row filters, and column masks.

Immuta and Privacera are ISV partners that integrate with Unity Catalog to expand data governance capabilities, including data privacy and fine-grained access controls. First, Privacera Access Management utilizes deployed data classification methods to support TBAC (Tag-Based Access Control), RBAC (Role-Based Access Control), and ABAC (Attribute-Based Access Control) policies applied down to the table, row, and column levels. On the other hand, Immuta Data Security Platform orchestrates dynamic attribute-based access control policies including dynamic grants and revokes, row-level security, and column masking. These policies and controls are enforced at query runtime, eliminating the need to predefine all data users.

Data Observability

Data Observability is another critical aspect of supporting effective data governance. It refers to the ability to monitor the health, quality, and behavior of data across all of an organization's systems and pipelines. Organizations that implement data observability continuously monitor data quality metrics, measure these indicators against company-defined policies and quickly identify and resolve data issues to enhance data integrity and reliability.

In Data and AI Summit 2023, Databricks launched Lakehouse Monitoring powered by Unity Catalog, which offers end-to-end Lakehouse observability capabilities. In the context of AI, Databricks Lakehouse Monitoring allows MLOps teams to benefit from monitoring auto-generated metrics to measure performance and gain visibility across ML pipelines and AI assets such as ML models and model serving endpoints. This unified approach simplifies the detection and evaluation of errors, allowing users to quickly perform root cause analysis and find solutions. The image below exemplifies how Lakehouse Monitoring can monitor table metrics to easily profile, diagnose, and enforce quality directly in the Databricks Data Intelligence Platform. The sample demo for Lakehouse Monitoring can be accessed via de Databricks demos page.

Monte Carlo integrates with Unity Catalog to provide a single pane of glass for data observability, including data quality monitoring, alerting, and root cause analysis functionalities. Their approach focuses on detecting data quality issues by monitoring failures or issues at the infrastructure level. Leveraging integrations with solutions like Fivetran or dbt helps engineers identify potential blind spots on downstream pipelines, directly impacting the quality and reliability of data within the Databricks Lakehouse.

Conclusion

With Unity Catalog Databricks serves as the foundation of all Data Governance dimensions of the Data Intelligence Platform. Organizations can adopt extended functionalities for data privacy and access control, data classification, data discoverability, lineage, and auditing using various ISV technology providers explored in this blog which integrate with Unity Catalog. By providing domain-specific integration examples we have illustrated how to achieve Enterprise Data Governance with Databricks.

Databricks Community

Enterprise-wide data governance integrations with Unity Catalog

Data Discoverability

Data Quality

Auditing Data Entitlements and Access

Data Classification

Data Privacy and Access Control

Data Observability

Conclusion

Metadata-Driven ETL Framework in Databricks (Part-1)

Best practices for safe data experimentation with Databricks

Top 10 query performance tuning tips for Databricks Serverless SQL