Enterprises have often federated teams with a central governance structure. Depending on the sensitivity of the data, the governance team controls who can have access to data. In this article we show how to implement governance and access control on Databricks at the table level when dealing with personally identifiable information (PII) and non-PII data.
For the purposes of this technical blog, we’re considering a simple scenario where we have just 2 tables: one table contains sales information, while the other contains PII. The former can be shared across the company, while for the latter the centralized governance team wants to restrict the access at table level.
In our scenario, one user A has ownership of two datasets “sales” and “PII” with user B. At the end of the exercise, user B should be able to query data from “sales” but not “PII”, according to least-privilege principles. Finer-grained (e.g. row-level security) access controls are outside the scope of this blog post.
The access control mechanisms can differ depending on the technology stack utilized, hence we will introduce the basics of Microsoft Fabric and Databricks Data Intelligence Platform and their terminologies.
Architecture
Setup of environment
Access pattern 1 - SQL Endpoint
Access pattern 2 - PySpark notebook access
Access pattern 3 - Lakehouse explorer (Browsing a preview of 5 rows)
Summary
(ML: from https://learn.microsoft.com/en-us/fabric/get-started/microsoft-fabric-overview)
Fabric’s lake is called OneLake and has a hierarchical structure. The main tenant can contain multiple “Workspaces” (e.g., one per business unit) and in the context of a workspace, you have Lakehouses (collections of files, folders, and tables).
Fabric is part of the Office 365 ecosystem and its default storage is called OneLake, an abstraction on top of ADLSv2. The default storage format is Delta. Fabric supports a single tenant and uses the construct of “Workspaces” to support multiple business units (eg. marketing, sales, manufacturing, …) and environment (dev, tst, stg, prd). A workspace consists of one or multiple lakehouses which are a collection of files, folders, and tables. A SQL analytics endpoint can be added to a lakehouse to provide data warehousing.
Figure 2
Depending on the role of the actor accessing the system, you can use Spark notebooks or SQL endpoints, among others. We will consider three ways:
Each user needs to have a role in a workspace, with the choice of Admin, Member, Contributor, and Viewer. The differences are explained here: https://learn.microsoft.com/en-us/fabric/get-started/roles-workspaces.
In this blog, we want to apply the principle of least privilege and provide access to the “sales” table, but not to the “PII” table. The administrator of “Workspace 1” will:
User A has “Admin” role for “Workspace 1”. The first step is to create the two tables with the statements below:
CREATE TABLE Sales (
sale_id INT,
product_name VARCHAR(50),
quantity INT,
price DECIMAL(10, 2),
customer_id INT
);
INSERT INTO Sales (sale_id, product_name, quantity, price, customer_id) VALUES
(1, 'Laptop', 2, 999.99, 1),
(2, 'Smartphone', 3, 299.99, 2),
(3, 'Headphones', 5, 79.99, 3),
(4, 'Tablet', 1, 499.99, 4),
(5, 'Smartwatch', 2, 199.99, 5);
CREATE TABLE PII (
id INT,
name VARCHAR(50),
email VARCHAR(100),
phone_number VARCHAR(15),
address VARCHAR(100)
);
INSERT INTO PII (id, name, email, phone_number, address) VALUES
(1, 'John Doe', 'johndoe@example.com', '123-456-7890', '123 Elm Street'),
(2, 'Jane Smith', 'janesmith@example.com', '234-567-8901', '456 Oak Street'),
(3, 'Alice Johnson', 'alicej@example.com', '345-678-9012', '789 Pine Street'),
(4, 'Bob Brown', 'bobbrown@example.com', '456-789-0123', '101 Maple Street'),
(5, 'Carol White', 'carolwhite@example.com', '567-890-1234', '202 Birch Street');
First, User A needs to create a “Lakehouse” as a container for the tables. They then open a “SQL analytics endpoint”, but they’re not able to perform CREATE or INSERT, since the SQL endpoint only allows for querying data, not performing DML or DDL statements. User A then opens a notebook and runs the statements in SQL cells. The data is stored in the default OneLake storage and the tables are visible in the lakehouse via the “Explorer” interface, where users can also preview the table content (with a limit of 5 rows).
User A can successfully query the two tables (“sales” and “PII”) both via the “SQL analytics endpoint” and via the notebook (either via SparkSQL or PySpark).
Lakehouse Explorer |
SQL analytics endpoint |
Notebook |
|
CREATE |
N |
N |
Y |
SELECT |
Y (subset) |
Y |
Y |
Table 1
Now User A invites User B by adding them to the workspace as “Contributor”. User B goes to the “Explorer” of the lakehouse and can browse all the tabular data (both “sales” and “PII” tables). User B, via right-click, can go into “View files” and see underlying data. User B can also rename the table as a contributor.
At first glance, it’s clear that Fabric has a “default allow” approach, where the “deny” should be explicitly specified. Hence, User A has to DENY access to prevent User B to view PII data.
User A continues in their notebook, where they created the tables and issues a
DENY SELECT ON dbo.pii TO [userb@databricks.com];
However, DCL statements are not allowed in notebooks, so User A has to run the DENY from an SQL analytics endpoint.
Now User B tries to access the “PII” table in various ways:
When trying to SELECT the PII data from the SQL analytics endpoint, User B gets access denied as expected, and the error message (amended):
Started running query at line 1
The SELECT permission or external policy action 'Microsoft.Sql/Sqlservers/Databases/Schemas/Tables/Rows/Select' was denied on the object 'pii', database 'lakehouse', schema 'dbo'.
Msg 373, Level 14, State 5, Code line 1
Then from a notebook, User B can issue PySpark and SparkSQL and still get access to both “sales” and “pii” tables.
%%pyspark
df = spark.sql("SELECT * FROM lakehouse.pii LIMIT 1000")
display(df)
%%sql
SELECT * FROM lakehouse.pii LIMIT 1000
Finally, user B can still access the data and get a preview from the lakehouse “Explorer”, notwithstanding the DENY.
SQL analytics endpoint |
Notebook |
Lakehouse Explorer |
|
sales |
Y |
Y |
Y (subset) |
PII |
N |
Y |
Y (subset) |
Table 2
Conclusion: although User B has no GRANT on the table, they can still access the data via explorer/browsing and using a notebook either with PySpark or SQL.
Additional information: both “sales” and “pii” tables are located in OneLake and managed by Fabric. We also created “Shortcuts” to external tables, e.g. tables in ADLSv2, managed by Databricks: we experienced the same results described in Table 2, meaning that the DENY only affects the SQL analytics endpoint and User B is able to access external data via “Shortcuts”.
Pro’s:
Cons:
Databricks, specifically its Azure implementation, is a data lakehouse solution based on ADLSv2, with Delta as the default storage format. Unity Catalog provides a data governance layer for the users and an abstraction layer for the storage: structured, and unstructured. The catalog provides an “Explorer”; besides that, users access data via notebooks or via SQL warehouses, with access control enforced in all cases by the Unity Catalog.
Figure 3
Each user needs to have access to the workspace, with fine-grained permissions on compute resources and data. The approach with Unity Catalog is a default “deny all”
User A has “Admin” role in their “Workspace”. The first step is to create the two tables with the statements below:
CREATE TABLE Sales (
sale_id INT,
product_name VARCHAR(50),
quantity INT,
price DECIMAL(10, 2),
customer_id INT
);
INSERT INTO Sales (sale_id, product_name, quantity, price, customer_id) VALUES
(1, 'Laptop', 2, 999.99, 1),
(2, 'Smartphone', 3, 299.99, 2),
(3, 'Headphones', 5, 79.99, 3),
(4, 'Tablet', 1, 499.99, 4),
(5, 'Smartwatch', 2, 199.99, 5);
CREATE TABLE PII (
id INT,
name VARCHAR(50),
email VARCHAR(100),
phone_number VARCHAR(15),
address VARCHAR(100)
);
INSERT INTO PII (id, name, email, phone_number, address) VALUES
(1, 'John Doe', 'johndoe@example.com', '123-456-7890', '123 Elm Street'),
(2, 'Jane Smith', 'janesmith@example.com', '234-567-8901', '456 Oak Street'),
(3, 'Alice Johnson', 'alicej@example.com', '345-678-9012', '789 Pine Street'),
(4, 'Bob Brown', 'bobbrown@example.com', '456-789-0123', '101 Maple Street'),
(5, 'Carol White', 'carolwhite@example.com', '567-890-1234', '202 Birch Street');
User A has the choice of executing the statements either with a Notebook or a SQL warehouse.
The data is stored in the default ADLSv2 storage and the tables are visible in the Catalog via the “Explorer” interface, where users can also preview the table content. The “Catalog Explorer” can then be used to manage access control to tables.
User A can successfully query the two tables (“sales” and “pii”) both via the “SQL analytics endpoint” and via the notebook (either via SparkSQL or PySpark).
Catalog Explorer |
SQL editor |
Notebook |
|
CREATE |
Y (catalogs, schemas) |
Y |
Y |
SELECT |
Y (paginated) |
Y |
Y |
Table 3
Now User A invites User B by adding them to the workspace without the “Admin access” entitlement. User B goes to the “Catalog Explorer” but they are unable to access the tables or to navigate the corresponding schema containing them. This is in line with the default “deny all” approach
User A grants access to User B on catalog “catalog-edp” (via USE CATALOG). This allows User B to see the catalog’s schemas, but nothing else. Then User A grants User B access to the schema, via USE SCHEMA. Now User B can see the schema’s tables, but cannot access the data. Finally, User B grants access to the table(s) (GRANT SELECT).
User B can perform all these grant operations either via the “Catalog Explorer” UI or via SQL statements in a notebook or SQL editor.
The following table summarizes how User B can access tables (or not).
SQL editor |
Notebook |
Catalog Explorer |
|
sales |
Y |
Y |
Y (paginated) |
pii |
N |
N |
N |
Table 4
Conclusion: User B can access only the tables for which they’ve been explicitly GRANTed access. The access methods (SQL, Notebook, explorer) consistently follow access control governed by Unity Catalog.
Pros:
Cons:
In this blog we explored coarse-grained access control such as table access control as a first step. Both implementations in this blog are based on features available today and do not take into account future roadmap.
Today MSFT Fabric offers no mechanism to centrally implement access control both on the lakehouse (via pyspark) and data warehouse (via SQL endpoint), this creates multiple authorization silos and increases the risk of gaining unauthorized access to datasets.
The lack of centralized governance and access control in Fabric today across the various compute engines (KSQL, TSQL, Synapse, Spark) makes it unsuitable for implementing this use case today.
Microsoft promises to unify this data governance within Fabric in the future through “One Security”, although no public documentation is available today that shows how this will work. At the time of writing One Security is also not mentioned in the public roadmap. Also it seems that Purview will play a large role in data governance within Fabric, coupling governance in Fabric with Purview.
Today Azure Databricks supports unified table-level access control across SQL data warehouse, catalog explorer, job and interactive clusters and serverless compute, offering a unified governance layer, simplifying the data access control strategy and reducing unauthorized data access.
A couple of topics not in scope for this blog but that could be addressed in future blogs are:
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.