Databricks Community

lammic · ‎10-21-2024

Enterprises have often federated teams with a central governance structure. Depending on the sensitivity of the data, the governance team controls who can have access to data. In this article we show how to implement governance and access control on Databricks at the table level when dealing with personally identifiable information (PII) and non-PII data.

Use case

For the purposes of this technical blog, we’re considering a simple scenario where we have just 2 tables: one table contains sales information, while the other contains PII. The former can be shared across the company, while for the latter the centralized governance team wants to restrict the access at table level.

In our scenario, one user A has ownership of two datasets “sales” and “PII” with user B. At the end of the exercise, user B should be able to query data from “sales” but not “PII”, according to least-privilege principles. Finer-grained (e.g. row-level security) access controls are outside the scope of this blog post.

Figure 1

The access control mechanisms can differ depending on the technology stack utilized, hence we will introduce the basics of Microsoft Fabric and Databricks Data Intelligence Platform and their terminologies.

Implementation on Microsoft Fabric

Architecture

Setup of environment

Access pattern 1 - SQL Endpoint

Access pattern 2 - PySpark notebook access

Access pattern 3 - Lakehouse explorer (Browsing a preview of 5 rows)

Summary

(ML: from https://learn.microsoft.com/en-us/fabric/get-started/microsoft-fabric-overview)

Fabric’s lake is called OneLake and has a hierarchical structure. The main tenant can contain multiple “Workspaces” (e.g., one per business unit) and in the context of a workspace, you have Lakehouses (collections of files, folders, and tables).

Fabric is part of the Office 365 ecosystem and its default storage is called OneLake, an abstraction on top of ADLSv2. The default storage format is Delta. Fabric supports a single tenant and uses the construct of “Workspaces” to support multiple business units (eg. marketing, sales, manufacturing, …) and environment (dev, tst, stg, prd). A workspace consists of one or multiple lakehouses which are a collection of files, folders, and tables. A SQL analytics endpoint can be added to a lakehouse to provide data warehousing.

Figure 2

Depending on the role of the actor accessing the system, you can use Spark notebooks or SQL endpoints, among others. We will consider three ways:

Browsing the lakehouse
Accessing data via PySpark code from a notebook
Accessing data via SQL from a SQL analytics endpoint

Each user needs to have a role in a workspace, with the choice of Admin, Member, Contributor, and Viewer. The differences are explained here: https://learn.microsoft.com/en-us/fabric/get-started/roles-workspaces.

In this blog, we want to apply the principle of least privilege and provide access to the “sales” table, but not to the “PII” table. The administrator of “Workspace 1” will:

Create table “sales” and “PII” with DDL and DML statements
Invite another user to “Workspace 1” and share “sales” data only

User A has “Admin” role for “Workspace 1”. The first step is to create the two tables with the statements below:

CREATE TABLE Sales (

sale_id INT,

product_name VARCHAR(50),

quantity INT,

price DECIMAL(10, 2),

customer_id INT

);

INSERT INTO Sales (sale_id, product_name, quantity, price, customer_id) VALUES

(1, 'Laptop', 2, 999.99, 1),

(2, 'Smartphone', 3, 299.99, 2),

(3, 'Headphones', 5, 79.99, 3),

(4, 'Tablet', 1, 499.99, 4),

(5, 'Smartwatch', 2, 199.99, 5);

CREATE TABLE PII (

id INT,

name VARCHAR(50),

email VARCHAR(100),

phone_number VARCHAR(15),

address VARCHAR(100)

);

INSERT INTO PII (id, name, email, phone_number, address) VALUES

(1, 'John Doe', 'johndoe@example.com', '123-456-7890', '123 Elm Street'),

(2, 'Jane Smith', 'janesmith@example.com', '234-567-8901', '456 Oak Street'),

(3, 'Alice Johnson', 'alicej@example.com', '345-678-9012', '789 Pine Street'),

(4, 'Bob Brown', 'bobbrown@example.com', '456-789-0123', '101 Maple Street'),

(5, 'Carol White', 'carolwhite@example.com', '567-890-1234', '202 Birch Street');

First, User A needs to create a “Lakehouse” as a container for the tables. They then open a “SQL analytics endpoint”, but they’re not able to perform CREATE or INSERT, since the SQL endpoint only allows for querying data, not performing DML or DDL statements. User A then opens a notebook and runs the statements in SQL cells. The data is stored in the default OneLake storage and the tables are visible in the lakehouse via the “Explorer” interface, where users can also preview the table content (with a limit of 5 rows).

User A can successfully query the two tables (“sales” and “PII”) both via the “SQL analytics endpoint” and via the notebook (either via SparkSQL or PySpark).

	Lakehouse Explorer	SQL analytics endpoint	Notebook
CREATE	N	N	Y
SELECT	Y (subset)	Y	Y

Table 1

Now User A invites User B by adding them to the workspace as “Contributor”. User B goes to the “Explorer” of the lakehouse and can browse all the tabular data (both “sales” and “PII” tables). User B, via right-click, can go into “View files” and see underlying data. User B can also rename the table as a contributor.

At first glance, it’s clear that Fabric has a “default allow” approach, where the “deny” should be explicitly specified. Hence, User A has to DENY access to prevent User B to view PII data.

User A continues in their notebook, where they created the tables and issues a

DENY SELECT ON dbo.pii TO [userb@databricks.com];

However, DCL statements are not allowed in notebooks, so User A has to run the DENY from an SQL analytics endpoint.

Now User B tries to access the “PII” table in various ways:

SQL analytics endpoint
Notebook
Via the Lakehouse Explorer feature

When trying to SELECT the PII data from the SQL analytics endpoint, User B gets access denied as expected, and the error message (amended):

Started running query at line 1

The SELECT permission or external policy action 'Microsoft.Sql/Sqlservers/Databases/Schemas/Tables/Rows/Select' was denied on the object 'pii', database 'lakehouse', schema 'dbo'.

Msg 373, Level 14, State 5, Code line 1

Then from a notebook, User B can issue PySpark and SparkSQL and still get access to both “sales” and “pii” tables.

%%pyspark

df = spark.sql("SELECT * FROM lakehouse.pii LIMIT 1000")

display(df)

%%sql

SELECT * FROM lakehouse.pii LIMIT 1000

Finally, user B can still access the data and get a preview from the lakehouse “Explorer”, notwithstanding the DENY.

	SQL analytics endpoint	Notebook	Lakehouse Explorer
sales	Y	Y	Y (subset)
PII	N	Y	Y (subset)

Table 2

Conclusion: although User B has no GRANT on the table, they can still access the data via explorer/browsing and using a notebook either with PySpark or SQL.

Additional information: both “sales” and “pii” tables are located in OneLake and managed by Fabric. We also created “Shortcuts” to external tables, e.g. tables in ADLSv2, managed by Databricks: we experienced the same results described in Table 2, meaning that the DENY only affects the SQL analytics endpoint and User B is able to access external data via “Shortcuts”.

Pro’s:

By exposing the tabular data via SQL analytics endpoint, you can implement granular data access controls with standard SQL (DENY) constructs.
The UI experience is nicely integrated, although this does not correspond with the underlying architectural silo’s.

Cons:

When sharing the workspace with another user, you give that user full access to all underlying tables in the workspace, even if that user is granted access to that data via the SQL analytics endpoint.
Standard ALLOW ALL for all access to the data via either PySpark in a notebook or previewing the data via Lakehouse Explorer.
The most limited role in a workspace is a “Viewer”, so PySpark access to the data via notebook is granted, but the preview still is able to show all tables including the PII table.
The SQL endpoint is read-only.

Implementation on Databricks

Architecture

Databricks, specifically its Azure implementation, is a data lakehouse solution based on ADLSv2, with Delta as the default storage format. Unity Catalog provides a data governance layer for the users and an abstraction layer for the storage: structured, and unstructured. The catalog provides an “Explorer”; besides that, users access data via notebooks or via SQL warehouses, with access control enforced in all cases by the Unity Catalog.

Figure 3

Each user needs to have access to the workspace, with fine-grained permissions on compute resources and data. The approach with Unity Catalog is a default “deny all”

Setup of environment

User A has “Admin” role in their “Workspace”. The first step is to create the two tables with the statements below:

CREATE TABLE Sales (

sale_id INT,

product_name VARCHAR(50),

quantity INT,

price DECIMAL(10, 2),

customer_id INT

);

INSERT INTO Sales (sale_id, product_name, quantity, price, customer_id) VALUES

(1, 'Laptop', 2, 999.99, 1),

(2, 'Smartphone', 3, 299.99, 2),

(3, 'Headphones', 5, 79.99, 3),

(4, 'Tablet', 1, 499.99, 4),

(5, 'Smartwatch', 2, 199.99, 5);

CREATE TABLE PII (

id INT,

name VARCHAR(50),

email VARCHAR(100),

phone_number VARCHAR(15),

address VARCHAR(100)

);

INSERT INTO PII (id, name, email, phone_number, address) VALUES

(1, 'John Doe', 'johndoe@example.com', '123-456-7890', '123 Elm Street'),

(2, 'Jane Smith', 'janesmith@example.com', '234-567-8901', '456 Oak Street'),

(3, 'Alice Johnson', 'alicej@example.com', '345-678-9012', '789 Pine Street'),

(4, 'Bob Brown', 'bobbrown@example.com', '456-789-0123', '101 Maple Street'),

(5, 'Carol White', 'carolwhite@example.com', '567-890-1234', '202 Birch Street');

User A has the choice of executing the statements either with a Notebook or a SQL warehouse.

The data is stored in the default ADLSv2 storage and the tables are visible in the Catalog via the “Explorer” interface, where users can also preview the table content. The “Catalog Explorer” can then be used to manage access control to tables.

User A can successfully query the two tables (“sales” and “pii”) both via the “SQL analytics endpoint” and via the notebook (either via SparkSQL or PySpark).

	Catalog Explorer	SQL editor	Notebook
CREATE	Y (catalogs, schemas)	Y	Y
SELECT	Y (paginated)	Y	Y

Table 3

Now User A invites User B by adding them to the workspace without the “Admin access” entitlement. User B goes to the “Catalog Explorer” but they are unable to access the tables or to navigate the corresponding schema containing them. This is in line with the default “deny all” approach

User A grants access to User B on catalog “catalog-edp” (via USE CATALOG). This allows User B to see the catalog’s schemas, but nothing else. Then User A grants User B access to the schema, via USE SCHEMA. Now User B can see the schema’s tables, but cannot access the data. Finally, User B grants access to the table(s) (GRANT SELECT).

User B can perform all these grant operations either via the “Catalog Explorer” UI or via SQL statements in a notebook or SQL editor.

The following table summarizes how User B can access tables (or not).

	SQL editor	Notebook	Catalog Explorer
sales	Y	Y	Y (paginated)
pii	N	N	N

Table 4

Conclusion: User B can access only the tables for which they’ve been explicitly GRANTed access. The access methods (SQL, Notebook, explorer) consistently follow access control governed by Unity Catalog.

Pros:

Unity Catalog provides a means to expose tabular data with granular access control, consistently respected by UI and compute engine.
All resources deployed are native Azure resources with infra as code support, such as Azure Databricks workspaces, Vnets, ADLSv2 storage accounts, resource groups, UC catalogs.

Cons:

Requires some central governance in the enterprise to setup metastore per Azure region and enable Unity Catalog per workspace

Conclusion

In this blog we explored coarse-grained access control such as table access control as a first step. Both implementations in this blog are based on features available today and do not take into account future roadmap.

Today MSFT Fabric offers no mechanism to centrally implement access control both on the lakehouse (via pyspark) and data warehouse (via SQL endpoint), this creates multiple authorization silos and increases the risk of gaining unauthorized access to datasets.

The lack of centralized governance and access control in Fabric today across the various compute engines (KSQL, TSQL, Synapse, Spark) makes it unsuitable for implementing this use case today.

Microsoft promises to unify this data governance within Fabric in the future through “One Security”, although no public documentation is available today that shows how this will work. At the time of writing One Security is also not mentioned in the public roadmap. Also it seems that Purview will play a large role in data governance within Fabric, coupling governance in Fabric with Purview.

Today Azure Databricks supports unified table-level access control across SQL data warehouse, catalog explorer, job and interactive clusters and serverless compute, offering a unified governance layer, simplifying the data access control strategy and reducing unauthorized data access.

A couple of topics not in scope for this blog but that could be addressed in future blogs are:

Implementing fine-grained access control (row level/column based) vs coarse-grained access control (table level).
Evaluating the Microsoft suggested integration pattern of shortcuts, David Meyer, Databricks SVP Product Management published a point of view on this topic recently already - https://www.databricks.com/blog/dont-be-beguiled-microsoft-fabric-shortcuts-yet

Databricks Community

Table-level governance and Access Control for the enterprise

Use case

Implementation on Microsoft Fabric

Implementation on Databricks

Architecture

Setup of environment

Conclusion

Metadata-Driven ETL Framework in Databricks (Part-1)

Top 10 query performance tuning tips for Databricks Serverless SQL

Best practices for safe data experimentation with Databricks