Databricks Unity Catalog (UC) is the industry’s only unified and open governance solution for data and AI, built into the Databricks Data Intelligence Platform. Unity Catalog provides a single source of truth for your organization’s data and AI assets, providing open connectivity to any data source, any format, unified governance with detailed lineage tracking, comprehensive monitoring, and support for open sharing and collaboration.
With its open APIs and introduction of credential vending, Databricks Unity Catalog data can be read by external engines and interfaces such as Iceberg REST APIs, DuckDB, Apache Spark™, Trino.
In this blog, we explore how you can use Apache Spark from an external (non-Databricks) processing engine to securely perform CRUD (Create, Read, Update, and Delete) operations on your tables registered in a Databricks Unity Catalog metastore, using UC’s open source REST APIs.
You can now use Spark SQL and DataFrame APIs to operate on Databricks Unity Catalog tables from an external processing engine, without having to configure your entire Spark application with one set of credentials to allow access to all your tables. Instead, the Spark integration will automatically acquire per-table credentials from UC (assuming the user has the necessary permissions) when running your Spark jobs.
If you’d like to learn how you can set up your own Unity Catalog server and use Apache Spark™ from an external (non-Databricks) processing engine to securely perform CRUD operations on your Delta tables registered in a Unity Catalog OSS metastore using UC’s open source REST APIs, please refer to this blog.
When Apache Spark requests access to data in a table registered in a Databricks UC metastore from an external processing engine, Unity Catalog issues short-lived credentials and URLs to control storage access based on the user’s specific IAM roles or managed identities, enabling data retrieval and query execution. The detailed steps are captured in the diagram below.
In this section, we’ll look at how you can perform CRUD operations on tables registered in Databricks Unity Catalog using Spark SQL and PySpark DataFrame APIs. We’ll walk through the following steps:
The first step is to download and configure Apache Spark. You can download the latest version of Spark (>= 3.5.3) using a command like the following:
|
Next, untar the package using the following command (for the rest of this tutorial, I’ll assume you’re using Spark 3.5.3):
|
You can access Databricks UC from Apache Spark via the terminal using the Spark SQL shell or the PySpark shell.
To use the Spark SQL shell (bin/spark-sql), go into the bin folder inside the downloaded Apache Spark folder (spark-3.5.3-bin-hadoop3) in your terminal:
|
Once you’re inside the bin folder, run the following command to launch the spark-sql shell (see below for a discussion of the packages and configuration options):
|
Note the following items in this command:
Now you’re ready to perform operations using Spark SQL in Databricks UC.
To use the PySpark shell (bin/pyspark), go into the bin folder inside your downloaded Apache Spark folder (spark-3.5.3-bin-hadoop3) in your terminal:
|
Once you’re inside the bin folder, run the following command to launch the pyspark shell (see the previous section for a discussion of the packages and configuration options):
|
Now you’re ready to perform operations using PySpark in Databricks UC.
In this step, I’ll walk you through performing some CRUD operations on Databricks UC tables. I’ll use Spark SQL here, but the same SQL commands can be run in the PySpark shell, embedding them inside spark.sql(). You can also use PySpark DataFrame APIs to perform DML (Data Manipulation Language) operations on the Databricks UC tables.
Here are some of the commands you can run inside the spark-sql shell, including example output for some of them:
|
|
|
|
|
|
For comparison, here’s how this information looks like in the Databricks workspace:
|
|
Here’s what you’ll see if you explore the same table from the Databricks workspace:
Next, let’s create a new managed table in the same UC catalog and schema from the Databricks workspace and read it from the local terminal. Here’s what you’ll see in the spark-sql shell before the managed table is created:
Now, create the managed table from the Databricks workspace:
Then run the show tables command from the local terminal again. Now the managed table is in the list:
You can insert data into the managed table from the local terminal as well as the Databricks workspace. First, insert and select some data from the local terminal:
|
You can select the same data from the Databricks workspace:
And here we are inserting data into the managed table from the Databricks workspace:
Here’s another example of selecting data from the managed table from both the local terminal and the Databricks workspace:
Now let’s try performing some upsert activities on the external table from the local terminal and the Databricks workspace. First, let's perform a delete activity on the external table from the local terminal:
|
Here we are selecting data from the external table using the local terminal and the Databricks workspace:
Next, let's perform an update activity on the managed table from the Databricks workspace:
And select the data from the managed table using the local terminal. You can notice the change reflected for the record with id = 3:
We can show the history of the changes in the managed table due to DML operations from the local terminal and view it in the Databricks workspace UI as well:
|
Next, let’s do a quick test to verify that access control is working as we expect. To perform this test, we’re going to make a change to the UC permissions for the Databricks authenticated user who is accessing the spark-sql shell from the local terminal. Namely, we’ll remove their access to the UC object (i.e., the Delta table) we’ve been working with.
Let's change the owner of the external Delta table to a service principal instead of the current authenticated named user. This means the named user no longer has SELECT permission on the table.
This query from the Databricks workspace will now fail with insufficient permissions, as expected:
The same query from the local terminal fails with insufficient permissions as well:
This shows that the user who was authenticated using the Databricks PAT also requires proper UC authorization to access the data governed by Unity Catalog. Without the proper UC permission, access to the UC object will be denied.
You can query UniForm Iceberg tables from Databricks Unity Catalog through the Iceberg REST API. This allows you to access these tables from any client that supports Iceberg REST APIs without introducing new dependencies.
To enable accessing your UniForm Iceberg tables in Databricks UC from the local terminal, enter the following command to launch the spark-sql shell:
|
Notice the inclusion of an additional package, org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.6.1, for Iceberg. This is the only change from the previous configuration shown for launching the spark-sql shell.
Let’s run some queries against a UniForm Iceberg table registered in Databricks UC. First, we'll create a UniForm Iceberg table. We can do this from the local terminal or the Databricks workspace:
|
We can now execute the following commands in the local terminal:
|
Here’s what we’ll see:
Next, let’s insert some data into the UniForm Iceberg table, via the Databricks workspace:
|
We can read from this table in the local terminal or the Databricks workspace using commands like the following:
|
Finally, let’s update the UniForm Iceberg table and read from it again. Here’s the query we ran to update the table (as usual, you can enter this in the local terminal or the Databricks workspace, as shown here):
|
And here’s what we see when we read from the table:
This blog showed you how to use Apache Spark from an external (non-Databricks) processing engine to securely perform CRUD operations on your Delta tables registered in a Databricks Unity Catalog metastore, using UC’s open source REST APIs. We also looked at how you can read your UniForm Iceberg tables registered in Databrick UC using the Iceberg REST API. Try out Unity Catalog’s open APIs today to access and process your data securely from any external engine using Apache Spark!
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.