cancel
Showing results for 
Search instead for 
Did you mean: 
Technical Blog
Explore in-depth articles, tutorials, and insights on data analytics and machine learning in the Databricks Technical Blog. Stay updated on industry trends, best practices, and advanced techniques.
cancel
Showing results for 
Search instead for 
Did you mean: 
dkushari
Databricks Employee
Databricks Employee

Databricks Unity Catalog (UC) is the industry’s only unified and open governance solution for data and AI, built into the Databricks Data Intelligence Platform. Unity Catalog provides a single source of truth for your organization’s data and AI assets, with open connectivity to any data source and any format, unified governance with detailed lineage tracking, comprehensive monitoring, and support for open sharing and collaboration.

With its open APIs and introduction of credential vending, Databricks Unity Catalog data can be read by external engines and interfaces such as Iceberg REST APIs, DuckDB, Apache Spark™, Trino. 

In this blog, we explore how you can use Apache Spark from an external (non-Databricks) processing engine to securely perform CRUD (Create, Read, Update, and Delete) operations on your tables registered in a Databricks Unity Catalog, using UC’s open source REST APIs and Iceberg Rest Catalog (IRC) APIs. 

You can now use Spark SQL and DataFrame APIs to operate on Databricks Unity Catalog tables from an external processing engine, without having to configure your entire Spark application with one set of credentials to allow access to all your tables. Instead, the Spark integration will automatically acquire per-table credentials from UC (assuming the user has the necessary permissions) when running your Spark jobs. 

If you’d like to learn how you can set up your own Unity Catalog server and use Apache Spark™ from an external (non-Databricks) processing engine to securely perform CRUD operations on your Delta tables registered in a Unity Catalog OSS metastore using UC’s open source REST APIs, please refer to this blog.

Securing Access Requests from External Engines

When Apache Spark requests access to data in a table registered in a Databricks UC metastore from an external processing engine, Unity Catalog issues short-lived credentials and URLs to control storage access based on the user’s specific IAM roles or managed identities, enabling data retrieval and query execution. The detailed steps are captured in the diagram below.

dkushari_0-1730724640060.png

Experiencing Apache Spark in Action with Unity Catalog’s Open APIs

In this section, we’ll look at how you can perform CRUD operations on tables registered in Databricks Unity Catalog using Spark SQL and PySpark DataFrame APIs. We’ll walk through the following steps:

  1. Setting up Apache Spark on the local workstation
  2. Accessing Databricks UC from the local terminal
  3. Performing CRUD operations on Delta tables in Databricks UC from the local terminal
  4. Performing a UC access control test
  5. Performing CRUD operations on Managed Iceberg tables in Databricks UC from the local terminal
  6. Accessing UniForm Iceberg tables in Databricks UC from the local terminal

Step 1: Setting up Apache Spark on the local workstation

The first step is to download and configure Apache Spark. You can download the latest version (In this blog, I used spark-4.0.1) of Spark (>= 4.0.1) using a command like the following:

curl -O https://archive.apache.org/dist/spark/spark-4.0.1/spark-4.0.1-bin-hadoop3.tgz

Next, untar the package using the following command (for the rest of this tutorial, I’ll consider you’re using spark-4.0.1) :

tar xzf spark-4.0.1-bin-hadoop3.tgz

Step 2: Accessing Databricks UC from the local terminal

You can access Databricks UC from Apache Spark via the terminal using the Spark SQL shell or the PySpark shell.

Accessing Databricks UC from the Spark SQL shell

To use the Spark SQL shell (bin/spark-sql), go into the bin folder inside the downloaded Apache Spark folder (spark-4.0.1-bin-hadoop3) in your terminal:

cd spark-4.0.1-bin-hadoop3/bin

Once you’re inside the bin folder, run the following command to launch the spark-sql shell (see below for a discussion of the packages and configuration options): 

./spark-sql --name "UC-OpenAPI-OAuth" \
  --master "local[*]" \
  --packages "io.delta:delta-spark_2.13:4.0.1,io.unitycatalog:unitycatalog-spark_2.13:0.3.1,org.apache.hadoop:hadoop-aws:3.4.0" \
  --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" \
  --conf "spark.sql.catalog.spark_catalog=io.unitycatalog.spark.UCSingleCatalog" \
  --conf "spark.hadoop.fs.s3.impl=org.apache.hadoop.fs.s3a.S3AFileSystem" \
  --conf "spark.sql.catalog.<<Your Default UC Catalog>>=io.unitycatalog.spark.UCSingleCatalog" \
  --conf "spark.sql.catalog.<<Your Default UC Catalog>>.uri=https://<<Your Databricks Workspace URL>>" \
  --conf "spark.sql.catalog.<<Your Default UC Catalog>>.auth.type=oauth" \
  --conf "spark.sql.catalog.<<Your Default UC Catalog>>.auth.oauth.uri=https://<<Your Databricks Workspace URL>>/oidc/v1/token" \
  --conf "spark.sql.catalog.<<Your Default UC Catalog>>.auth.oauth.clientId=<<ClientID from Your OAuth secret>>" \
  --conf "spark.sql.catalog.<<Your Default UC Catalog>>.auth.oauth.clientSecret=<<ClientSecret from Your OAuth secret>>" \
  --conf "spark.sql.catalog.<<Your Default UC Catalog>>.renewCredential.enabled=true" \
  --conf "spark.sql.defaultCatalog=<<Your Default UC Catalog>>"

Note the following items in this command:

  • --packages points to the delta-spark and unitycatalog-spark packages.
  • spark.sql.defaultCatalog=<<Your Default UC Catalog>> must be filled out to indicate the default catalog you want to use when launching the spark-sql shell.
  • spark.sql.catalog.<<Your Default UC Catalog>>.uri points to the Databricks UC REST API endpoint for the workspace.
  • spark.sql.catalog.<<Your Default UC Catalog>>.auth.oauth.clientId and spark.sql.catalog.<<Your Default UC Catalog>>.auth.oauth.clientSecret are your Client ID and Client Secret for M2M OAuth authorization. They are used to authenticate the principal as a legitimate user to the Databricks platform. Access to the UC objects is controlled via the UC permissions model.
  • spark.hadoop.fs.s3.impl=org.apache.hadoop.fs.s3a.S3AFileSystem must be set to access your cloud object storage (in this example, AWS S3).

Now you’re ready to perform operations using Spark SQL in Databricks UC.

Accessing Databricks UC from the PySpark shell

To use the PySpark shell (bin/pyspark), go into the bin folder inside your downloaded Apache Spark folder (spark-4.0.1-bin-hadoop3) in your terminal:

cd spark-4.0.1-bin-hadoop3/bin

Once you’re inside the bin folder, run the following command to launch the pyspark shell (see the previous section for a discussion of the packages and configuration options):

./pyspark --name "UC-OpenAPI-OAuth" \
  --master "local[*]" \
  --packages "io.delta:delta-spark_2.13:4.0.1,io.unitycatalog:unitycatalog-spark_2.13:0.3.1,org.apache.hadoop:hadoop-aws:3.4.0" \
  --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" \
  --conf "spark.sql.catalog.spark_catalog=io.unitycatalog.spark.UCSingleCatalog" \
  --conf "spark.hadoop.fs.s3.impl=org.apache.hadoop.fs.s3a.S3AFileSystem" \
  --conf "spark.sql.catalog.<<Your Default UC Catalog>>=io.unitycatalog.spark.UCSingleCatalog" \
  --conf "spark.sql.catalog.<<Your Default UC Catalog>>.uri=https://<<Your Databricks Workspace URL>>" \
  --conf "spark.sql.catalog.<<Your Default UC Catalog>>.auth.type=oauth" \
  --conf "spark.sql.catalog.<<Your Default UC Catalog>>.auth.oauth.uri=https://<<Your Databricks Workspace URL>>/oidc/v1/token" \
  --conf "spark.sql.catalog.<<Your Default UC Catalog>>.auth.oauth.clientId=<<ClientID from Your OAuth secret>>" \
  --conf "spark.sql.catalog.<<Your Default UC Catalog>>.auth.oauth.clientSecret=<<ClientSecret from Your OAuth secret>>" \
  --conf "spark.sql.catalog.<<Your Default UC Catalog>>.renewCredential.enabled=true" \
  --conf "spark.sql.defaultCatalog=<<Your Default UC Catalog>>"

Now you’re ready to perform operations using PySpark in Databricks UC.

Step 3: Performing CRUD operations on Delta tables in Databricks UC from the local terminal

In this step, I’ll walk you through performing some CRUD operations on Databricks UC tables. I’ll use Spark SQL here, but the same SQL commands can be run in the PySpark shell by embedding them in spark.sql(). You can also use PySpark DataFrame APIs to perform DML (Data Manipulation Language) operations on the Databricks UC tables.

Here are some of the commands you can run inside the spark-sql shell, including example output for some of them:

  • See all the accessible UC catalogs:
SHOW CATALOGS;
  • See all the available schemas in the default (<<Your Default UC Catalog>>) catalog you used when launching the spark-sql shell: 
SHOW SCHEMAS;

dkushari_0-1770409472761.png

  • Create a new schema:
CREATE SCHEMA db_schema;
  • Use this new schema as the default schema: 
USE SCHEMA db_schema;
  • Create a new external table:
CREATE external TABLE
db_uc_table (id INT, desc STRING)
USING DELTA
LOCATION 's3://<<Your AWS S3 Location>>';
  • Describe the newly created table:
DESC EXTENDED db_uc_table;

dkushari_1-1770409636864.png

For comparison, here’s how this information looks like in the Databricks workspace:

dkushari_2-1770409667197.png

dkushari_3-1770409737270.png

 

  • Insert some records from the local terminal into the newly created table and select from that table:
INSERT INTO db_uc_table VALUES (1,'a'),(2,'b'), (3,'c'),(4,'d'),(5,'e');
SELECT * FROM db_uc_table ORDER BY id;

 

dkushari_3-1770410233348.png

Here’s what you’ll see if you explore the same table from the Databricks workspace:

dkushari_4-1770410306300.png

dkushari_5-1770410360125.png

dkushari_6-1770410403825.png

Next, let’s create a new managed table in the same UC catalog and schema from the Databricks workspace and read it from the local terminal. Here’s what you’ll see in the spark-sql shell before the managed table is created:

dkushari_7-1770410441213.png

Note: At the time of writing, the ability to create and write data to managed tables from external clients such as Apache Spark is available only in Private Preview. Please contact your Databricks account team to request access.

Now, create the managed Delta table from the local terminal. Then run the show tables command again from the local terminal. Now the managed table is in the list:

CREATE TABLE db_uc_table_managed (id INT, desc STRING) USING delta TBLPROPERTIES ('delta.feature.catalogManaged'='supported');

dkushari_8-1770410588510.png

dkushari_9-1770410642015.png

You can insert data into the managed table from both the local terminal and the Databricks workspace. First, insert and select some data from the local terminal:

INSERT INTO db_uc_table_managed VALUES (1,'a');
SELECT * FROM db_uc_table_managed;

dkushari_10-1770410753503.png

You can select the same data from the Databricks workspace:

dkushari_11-1770410813949.png

And here we are inserting data into the managed table from the Databricks workspace:

dkushari_12-1770410837937.png

Here’s another example of selecting data from the managed table from both the local terminal and the Databricks workspace:

dkushari_13-1770410866702.png

dkushari_14-1770410904228.png

Now let’s try performing some upsert activities on the external table from the local terminal and the Databricks workspace. First, let's perform a delete activity on the external table from the local terminal:

DELETE FROM db_uc_table WHERE id IN (1, 3);

Here we are selecting data from the external table from the local terminal and the Databricks workspace, and notice that 2 rows have been deleted:

dkushari_15-1770411022198.png

 

dkushari_18-1770411144897.png

Next, let's perform an update activity on the managed table from the Databricks workspace:

dkushari_19-1770411244624.png

UPDATE db_uc_table_managed SET desc = 'f' WHERE id = 3;

And select the data from the managed table using the local terminal. You can notice the change reflected for the record with id = 3:

dkushari_20-1770411331566.png

dkushari_21-1770411349572.png

We can show the history of changes to the managed table, due to DML operations, from the Databricks workspace UI:

dkushari_22-1770411436859.png

 

DESC HISTORY db_uc_table_managed;

Step 4: Performing a UC access control test

Next, let’s run a quick test to verify that access control is working as expected. To perform this test, we’re going to make a change to the UC permissions for the Databricks authenticated user who is accessing the spark-sql shell from the local terminal. Namely, we’ll remove their access to the UC object (i.e., the External Delta table, namely db_uc_table) we’ve been working with.

Let's change the owner of the external Delta table to another user. This means that the user whose credentials are used from the spark-sql shell of the local terminal to access the UC data no longer has SELECT permission on the table.

dkushari_23-1770411555702.png

dkushari_24-1770411573021.png

This query from the local terminal fails with permission denied, as expected:

dkushari_25-1770411612186.png

This shows that the user who was authenticated via Databricks M2M OAuth also requires proper UC authorization to access data governed by Unity Catalog. Without the proper UC permission, access to the UC object will be denied.

Step 5: Performing CRUD operations on Managed Iceberg tables in Databricks UC from the local terminal

You can perform CRUD operations on Managed Iceberg tables from Databricks Unity Catalog through the Iceberg REST Catalog (IRC) API. This allows you to access these tables from any client that supports Iceberg REST Catalog APIs without introducing new dependencies.

To access Managed Iceberg tables in Databricks UC from the local terminal, enter the following command to launch the spark-sql shell:

./spark-sql --name "uc-iceberg-oauth" \
    --master "local[*]" \
    --packages "org.apache.iceberg:iceberg-spark-runtime-4.0_2.13:1.10.1,org.apache.iceberg:iceberg-aws-bundle:1.6.1,io.unitycatalog:unitycatalog-spark_2.13:0.3.1,io.delta:delta-spark_2.13:4.0.1" \
    --conf "spark.hadoop.fs.s3.impl=org.apache.iceberg.aws.s3.S3FileIO" \
    --conf "spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions" \
    --conf "spark.sql.catalog.iceberg=org.apache.iceberg.spark.SparkCatalog" \
    --conf "spark.sql.catalog.iceberg.catalog-impl=org.apache.iceberg.rest.RESTCatalog" \
    --conf "spark.sql.catalog.iceberg.uri=https://<<Your Databricks Workspace URL>>/api/2.1/unity-catalog/iceberg-rest" \
    --conf "spark.sql.catalog.iceberg.oauth2-server-uri=https://<<Your Databricks Workspace URL>>/oidc/v1/token" \
    --conf "spark.sql.catalog.iceberg.scope=all-apis" \
    --conf "spark.sql.catalog.iceberg.credential=<<ClientID from Your OAuth secret>>:<<ClientSecret from Your OAuth secret>>" \
    --conf "spark.sql.defaultCatalog=iceberg" \
    --conf "spark.sql.catalog.iceberg.warehouse=<<Your Default UC Catalog>>"

Let’s create a Managed Iceberg table from the local terminal and view its details.

USE db_schema;
SHOW TABLES;
CREATE TABLE db_uc_table_managed_ib (id INT, desc STRING) USING ICEBERG;
SHOW TABLES;
DESC EXTENDED db_uc_table_managed_ib;

dkushari_26-1770412050123.png

Now, let's insert some rows into the Managed Iceberg table from the local terminal:

INSERT INTO db_uc_table_managed_ib VALUES (1, 'a'), (2,'b'), (3,'c'),(4,'d'),(5,'e');

Here’s what we’ll see in the local terminal when we select rows from the Managed Iceberg table:

SELECT * FROM db_uc_table_managed_ib ORDER BY id;

dkushari_27-1770412135944.png

We can also read the same Managed Iceberg table from the Databricks workspace:

dkushari_28-1770412174401.png

Let’s update the Managed Iceberg table and read from it again. Here’s the query we ran to update the table (as usual, you can enter this in the local terminal or the Databricks workspace, as shown here) from the local terminal:

UPDATE db_uc_table_managed_ib SET desc = 'updated' where id = 5;

And here’s what we see when we read from the table:

dkushari_29-1770412221266.png

dkushari_30-1770412252750.png

Step 6: Accessing UniForm Iceberg tables in Databricks UC from the local terminal

Using the Iceberg REST Catalog (IRC) API from the local terminal, we can also read the UniForm Iceberg tables, created from the Databricks workspace. Here is an example:

CREATE TABLE iceberg_ext_tab(id INT, desc STRING)
location "s3://<<Your AWS S3 Location>>"
TBLPROPERTIES(
'delta.enableIcebergCompatV2' = 'true',
'delta.universalFormat.enabledFormats' = 'iceberg');

INSERT INTO iceberg_ext_tab VALUES (1, 'a'), (2,'b'), (3,'c'),(4,'d'),(5,'e');

DESC EXTENDED iceberg_ext_tab;

SELECT * FROM iceberg_ext_tab ORDER BY id;

dkushari_31-1770412331537.png

dkushari_32-1770412370803.pngdkushari_33-1770412399718.png

Conclusion

This blog showed you how to use Apache Spark from an external (non-Databricks) processing engine to securely perform CRUD operations on your Delta and Managed Iceberg tables registered in a Databricks Unity Catalog metastore, using UC’s open source REST APIs and Iceberg REST Catalog (IRC) APIs. We also looked at how you can read your UniForm Iceberg tables registered in Databricks UC using the Iceberg REST Catalog (IRC) API. Try out Unity Catalog’s Open APIs and Iceberg REST Catalog (IRC) APIs today to access and process your data securely from any external engine using Apache Spark.

4 Comments