Databricks Unity Catalog (UC) is the industry’s only unified and open governance solution for data and AI, built into the Databricks Data Intelligence Platform. Unity Catalog provides a single source of truth for your organization’s data and AI assets, with open connectivity to any data source and any format, unified governance with detailed lineage tracking, comprehensive monitoring, and support for open sharing and collaboration.
With its open APIs and introduction of credential vending, Databricks Unity Catalog data can be read by external engines and interfaces such as Iceberg REST APIs, DuckDB, Apache Spark™, Trino.
In this blog, we explore how you can use Apache Spark from an external (non-Databricks) processing engine to securely perform CRUD (Create, Read, Update, and Delete) operations on your tables registered in a Databricks Unity Catalog, using UC’s open source REST APIs and Iceberg Rest Catalog (IRC) APIs.
You can now use Spark SQL and DataFrame APIs to operate on Databricks Unity Catalog tables from an external processing engine, without having to configure your entire Spark application with one set of credentials to allow access to all your tables. Instead, the Spark integration will automatically acquire per-table credentials from UC (assuming the user has the necessary permissions) when running your Spark jobs.
If you’d like to learn how you can set up your own Unity Catalog server and use Apache Spark™ from an external (non-Databricks) processing engine to securely perform CRUD operations on your Delta tables registered in a Unity Catalog OSS metastore using UC’s open source REST APIs, please refer to this blog.
When Apache Spark requests access to data in a table registered in a Databricks UC metastore from an external processing engine, Unity Catalog issues short-lived credentials and URLs to control storage access based on the user’s specific IAM roles or managed identities, enabling data retrieval and query execution. The detailed steps are captured in the diagram below.
In this section, we’ll look at how you can perform CRUD operations on tables registered in Databricks Unity Catalog using Spark SQL and PySpark DataFrame APIs. We’ll walk through the following steps:
The first step is to download and configure Apache Spark. You can download the latest version (In this blog, I used spark-4.0.1) of Spark (>= 4.0.1) using a command like the following:
|
Next, untar the package using the following command (for the rest of this tutorial, I’ll consider you’re using spark-4.0.1) :
|
You can access Databricks UC from Apache Spark via the terminal using the Spark SQL shell or the PySpark shell.
To use the Spark SQL shell (bin/spark-sql), go into the bin folder inside the downloaded Apache Spark folder (spark-4.0.1-bin-hadoop3) in your terminal:
|
Once you’re inside the bin folder, run the following command to launch the spark-sql shell (see below for a discussion of the packages and configuration options):
|
Note the following items in this command:
Now you’re ready to perform operations using Spark SQL in Databricks UC.
To use the PySpark shell (bin/pyspark), go into the bin folder inside your downloaded Apache Spark folder (spark-4.0.1-bin-hadoop3) in your terminal:
|
Once you’re inside the bin folder, run the following command to launch the pyspark shell (see the previous section for a discussion of the packages and configuration options):
|
Now you’re ready to perform operations using PySpark in Databricks UC.
In this step, I’ll walk you through performing some CRUD operations on Databricks UC tables. I’ll use Spark SQL here, but the same SQL commands can be run in the PySpark shell by embedding them in spark.sql(). You can also use PySpark DataFrame APIs to perform DML (Data Manipulation Language) operations on the Databricks UC tables.
Here are some of the commands you can run inside the spark-sql shell, including example output for some of them:
|
|
|
|
|
|
For comparison, here’s how this information looks like in the Databricks workspace:
|
|
Here’s what you’ll see if you explore the same table from the Databricks workspace:
Next, let’s create a new managed table in the same UC catalog and schema from the Databricks workspace and read it from the local terminal. Here’s what you’ll see in the spark-sql shell before the managed table is created:
Note: At the time of writing, the ability to create and write data to managed tables from external clients such as Apache Spark is available only in Private Preview. Please contact your Databricks account team to request access.
Now, create the managed Delta table from the local terminal. Then run the show tables command again from the local terminal. Now the managed table is in the list:
CREATE TABLE db_uc_table_managed (id INT, desc STRING) USING delta TBLPROPERTIES ('delta.feature.catalogManaged'='supported');
You can insert data into the managed table from both the local terminal and the Databricks workspace. First, insert and select some data from the local terminal:
|
You can select the same data from the Databricks workspace:
And here we are inserting data into the managed table from the Databricks workspace:
Here’s another example of selecting data from the managed table from both the local terminal and the Databricks workspace:
Now let’s try performing some upsert activities on the external table from the local terminal and the Databricks workspace. First, let's perform a delete activity on the external table from the local terminal:
|
Here we are selecting data from the external table from the local terminal and the Databricks workspace, and notice that 2 rows have been deleted:
Next, let's perform an update activity on the managed table from the Databricks workspace:
UPDATE db_uc_table_managed SET desc = 'f' WHERE id = 3;
And select the data from the managed table using the local terminal. You can notice the change reflected for the record with id = 3:
We can show the history of changes to the managed table, due to DML operations, from the Databricks workspace UI:
DESC HISTORY db_uc_table_managed;
Next, let’s run a quick test to verify that access control is working as expected. To perform this test, we’re going to make a change to the UC permissions for the Databricks authenticated user who is accessing the spark-sql shell from the local terminal. Namely, we’ll remove their access to the UC object (i.e., the External Delta table, namely db_uc_table) we’ve been working with.
Let's change the owner of the external Delta table to another user. This means that the user whose credentials are used from the spark-sql shell of the local terminal to access the UC data no longer has SELECT permission on the table.
This query from the local terminal fails with permission denied, as expected:
This shows that the user who was authenticated via Databricks M2M OAuth also requires proper UC authorization to access data governed by Unity Catalog. Without the proper UC permission, access to the UC object will be denied.
You can perform CRUD operations on Managed Iceberg tables from Databricks Unity Catalog through the Iceberg REST Catalog (IRC) API. This allows you to access these tables from any client that supports Iceberg REST Catalog APIs without introducing new dependencies.
To access Managed Iceberg tables in Databricks UC from the local terminal, enter the following command to launch the spark-sql shell:
|
Let’s create a Managed Iceberg table from the local terminal and view its details.
USE db_schema;
SHOW TABLES;
CREATE TABLE db_uc_table_managed_ib (id INT, desc STRING) USING ICEBERG;
SHOW TABLES;
DESC EXTENDED db_uc_table_managed_ib;
Now, let's insert some rows into the Managed Iceberg table from the local terminal:
INSERT INTO db_uc_table_managed_ib VALUES (1, 'a'), (2,'b'), (3,'c'),(4,'d'),(5,'e');
Here’s what we’ll see in the local terminal when we select rows from the Managed Iceberg table:
SELECT * FROM db_uc_table_managed_ib ORDER BY id;
We can also read the same Managed Iceberg table from the Databricks workspace:
Let’s update the Managed Iceberg table and read from it again. Here’s the query we ran to update the table (as usual, you can enter this in the local terminal or the Databricks workspace, as shown here) from the local terminal:
UPDATE db_uc_table_managed_ib SET desc = 'updated' where id = 5;
And here’s what we see when we read from the table:
Using the Iceberg REST Catalog (IRC) API from the local terminal, we can also read the UniForm Iceberg tables, created from the Databricks workspace. Here is an example:
|
This blog showed you how to use Apache Spark from an external (non-Databricks) processing engine to securely perform CRUD operations on your Delta and Managed Iceberg tables registered in a Databricks Unity Catalog metastore, using UC’s open source REST APIs and Iceberg REST Catalog (IRC) APIs. We also looked at how you can read your UniForm Iceberg tables registered in Databricks UC using the Iceberg REST Catalog (IRC) API. Try out Unity Catalog’s Open APIs and Iceberg REST Catalog (IRC) APIs today to access and process your data securely from any external engine using Apache Spark.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.