Databricks Community

dkushari · ‎11-04-2024

Databricks Unity Catalog (UC) is the industry’s only unified and open governance solution for data and AI, built into the Databricks Data Intelligence Platform. Unity Catalog provides a single source of truth for your organization’s data and AI assets, providing open connectivity to any data source, any format, unified governance with detailed lineage tracking, comprehensive monitoring, and support for open sharing and collaboration.

With its open APIs and introduction of credential vending, Databricks Unity Catalog data can be read by external engines and interfaces such as Iceberg REST APIs, DuckDB, Apache Spark™, Trino.

In this blog, we explore how you can use Apache Spark from an external (non-Databricks) processing engine to securely perform CRUD (Create, Read, Update, and Delete) operations on your tables registered in a Databricks Unity Catalog metastore, using UC’s open source REST APIs.

You can now use Spark SQL and DataFrame APIs to operate on Databricks Unity Catalog tables from an external processing engine, without having to configure your entire Spark application with one set of credentials to allow access to all your tables. Instead, the Spark integration will automatically acquire per-table credentials from UC (assuming the user has the necessary permissions) when running your Spark jobs.

If you’d like to learn how you can set up your own Unity Catalog server and use Apache Spark™ from an external (non-Databricks) processing engine to securely perform CRUD operations on your Delta tables registered in a Unity Catalog OSS metastore using UC’s open source REST APIs, please refer to this blog.

Securing Access Requests from External Engines

When Apache Spark requests access to data in a table registered in a Databricks UC metastore from an external processing engine, Unity Catalog issues short-lived credentials and URLs to control storage access based on the user’s specific IAM roles or managed identities, enabling data retrieval and query execution. The detailed steps are captured in the diagram below.

Experiencing Apache Spark in Action with Unity Catalog’s Open APIs

In this section, we’ll look at how you can perform CRUD operations on tables registered in Databricks Unity Catalog using Spark SQL and PySpark DataFrame APIs. We’ll walk through the following steps:

Setting up Apache Spark on the local workstation
Accessing Databricks UC from the local terminal
Performing CRUD operations on Delta tables whose data is stored in cloud storage
Performing a UC access control test
Accessing UniForm Iceberg tables in Databricks UC from the local terminal
Performing read operations on Iceberg tables whose data is stored in cloud storage

Step 1: Setting up Apache Spark on the local workstation

The first step is to download and configure Apache Spark. You can download the latest version of Spark (>= 3.5.3) using a command like the following:

curl -O https://archive.apache.org/dist/spark/spark-3.5.3/spark-3.5.3-bin-hadoop3.tgz

Next, untar the package using the following command (for the rest of this tutorial, I’ll assume you’re using Spark 3.5.3):

tar xzf spark-3.5.3-bin-hadoop3.tgz

Step 2: Accessing Databricks UC from the local terminal

You can access Databricks UC from Apache Spark via the terminal using the Spark SQL shell or the PySpark shell.

Accessing Databricks UC from the Spark SQL shell

To use the Spark SQL shell (bin/spark-sql), go into the bin folder inside the downloaded Apache Spark folder (spark-3.5.3-bin-hadoop3) in your terminal:

cd spark-3.5.3-bin-hadoop3/bin

Once you’re inside the bin folder, run the following command to launch the spark-sql shell (see below for a discussion of the packages and configuration options):

./spark-sql --name "s3-uc-test" \
	--master "local[*]" \
	--packages "org.apache.hadoop:hadoop-aws:3.3.4,org.apache.hadoop:hadoop-common:3.3.4,io.delta:delta-spark_2.12:3.2.1,io.unitycatalog:unitycatalog-spark_2.12:0.2.0" \
	--conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" \
	--conf "spark.sql.catalog.spark_catalog=io.unitycatalog.spark.UCSingleCatalog" \
	--conf "spark.hadoop.fs.s3.impl=org.apache.hadoop.fs.s3a.S3AFileSystem" \
	--conf "spark.sql.catalog.<<Your Default UC Catalog>>=io.unitycatalog.spark.UCSingleCatalog" \
	--conf "spark.sql.catalog.<<Your Default UC Catalog>>.uri=https://<<Your Databricks Workspace URL>>/api/2.1/unity-catalog" \
	--conf "spark.sql.catalog.<<Your Default UC Catalog>>.token=<<Your Databricks Workspace PAT>>" \
	--conf "spark.sql.defaultCatalog=<<Your Default UC Catalog>>"

Note the following items in this command:

--packages points to the delta-spark and unitycatalog-spark packages.
spark.sql.defaultCatalog=<<Your Default UC Catalog>> must be filled out to indicate the default catalog you want to use when launching the spark-sql shell.
spark.sql.catalog.<<Your Default UC Catalog>>.uri points to the Databricks UC REST API endpoint for the workspace.
spark.sql.catalog.<<Your Default UC Catalog>>.token is your Databricks workspace personal access token (PAT), which is used to authenticate you as a legitimate user to the Databricks platform. Access to the UC objects is controlled via the UC permissions model.
spark.hadoop.fs.s3.impl=org.apache.hadoop.fs.s3a.S3AFileSystem must be set to access your cloud object storage (in this example, AWS S3).
Always check Maven repository to get latest versions of the jars and update the packages in the prompt for accessing spark

Now you’re ready to perform operations using Spark SQL in Databricks UC.

Accessing Databricks UC from the PySpark shell

To use the PySpark shell (bin/pyspark), go into the bin folder inside your downloaded Apache Spark folder (spark-3.5.3-bin-hadoop3) in your terminal:

cd spark-3.5.3-bin-hadoop3/bin

Once you’re inside the bin folder, run the following command to launch the pyspark shell (see the previous section for a discussion of the packages and configuration options):

./pyspark --name "s3-uc-test" \
	--master "local[*]" \
	--packages "org.apache.hadoop:hadoop-aws:3.3.4,org.apache.hadoop:hadoop-common:3.3.4,io.delta:delta-spark_2.12:3.2.1,io.unitycatalog:unitycatalog-spark_2.12:0.2.0" \
	--conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" \
	--conf "spark.sql.catalog.spark_catalog=io.unitycatalog.spark.UCSingleCatalog" \
	--conf "spark.hadoop.fs.s3.impl=org.apache.hadoop.fs.s3a.S3AFileSystem" \
	--conf "spark.sql.catalog.<<Your Default UC Catalog>>=io.unitycatalog.spark.UCSingleCatalog" \
	--conf "spark.sql.catalog.<<Your Default UC Catalog>>.uri=https://<<Your Databricks Workspace URL>>/api/2.1/unity-catalog" \
	--conf "spark.sql.catalog.<<Your Default UC Catalog>>.token=<<Your Databricks Workspace PAT>>" \
	--conf "spark.sql.defaultCatalog=<<Your Default UC Catalog>>"

Now you’re ready to perform operations using PySpark in Databricks UC.

Step 3: Performing CRUD operations on Delta tables whose data is stored in cloud storage

In this step, I’ll walk you through performing some CRUD operations on Databricks UC tables. I’ll use Spark SQL here, but the same SQL commands can be run in the PySpark shell, embedding them inside spark.sql(). You can also use PySpark DataFrame APIs to perform DML (Data Manipulation Language) operations on the Databricks UC tables.

Here are some of the commands you can run inside the spark-sql shell, including example output for some of them:

See all the accessible UC catalogs:

SHOW CATALOGS;

See all the available schemas in the default (<<Your Default UC Catalog>>) catalog you used when launching the spark-sql shell:

SHOW SCHEMAS;

Create a new schema:

create schema db_schema;

Use this new schema as the default schema:

use schema db_schema;

Create a new external table:

CREATE external TABLE
db_uc_table (id INT, desc STRING)
USING delta
LOCATION 's3://<<Your AWS S3 Location>>';

Describe the newly created table:

desc extended db_uc_table;

For comparison, here’s how this information looks like in the Databricks workspace:

Insert some records from the local terminal into the newly created table and select from that table:

insert into db_uc_table values (1,'a'),(2,'b'), (3,'c'),(4,'d'),(5,'e');

select * from db_uc_table order by id;

Here’s what you’ll see if you explore the same table from the Databricks workspace:

Next, let’s create a new managed table in the same UC catalog and schema from the Databricks workspace and read it from the local terminal. Here’s what you’ll see in the spark-sql shell before the managed table is created:

Now, create the managed table from the Databricks workspace:

Then run the show tables command from the local terminal again. Now the managed table is in the list:

You can insert data into the managed table from the local terminal as well as the Databricks workspace. First, insert and select some data from the local terminal:

insert into db_uc_table_managed values (1,'a');
select * from db_uc_table_managed;

You can select the same data from the Databricks workspace:

And here we are inserting data into the managed table from the Databricks workspace:

Here’s another example of selecting data from the managed table from both the local terminal and the Databricks workspace:

Now let’s try performing some upsert activities on the external table from the local terminal and the Databricks workspace. First, let's perform a delete activity on the external table from the local terminal:

delete from db_uc_table where id in (1,3);

Here we are selecting data from the external table using the local terminal and the Databricks workspace:

Next, let's perform an update activity on the managed table from the Databricks workspace:

And select the data from the managed table using the local terminal. You can notice the change reflected for the record with id = 3:

We can show the history of the changes in the managed table due to DML operations from the local terminal and view it in the Databricks workspace UI as well:

desc history db_uc_table_managed;

Step 4: Performing a UC access control test

Next, let’s do a quick test to verify that access control is working as we expect. To perform this test, we’re going to make a change to the UC permissions for the Databricks authenticated user who is accessing the spark-sql shell from the local terminal. Namely, we’ll remove their access to the UC object (i.e., the Delta table) we’ve been working with.

Let's change the owner of the external Delta table to a service principal instead of the current authenticated named user. This means the named user no longer has SELECT permission on the table.

This query from the Databricks workspace will now fail with insufficient permissions, as expected:

The same query from the local terminal fails with insufficient permissions as well:

This shows that the user who was authenticated using the Databricks PAT also requires proper UC authorization to access the data governed by Unity Catalog. Without the proper UC permission, access to the UC object will be denied.

Step 5: Accessing UniForm Iceberg tables in Databricks UC from the local terminal

You can query UniForm Iceberg tables from Databricks Unity Catalog through the Iceberg REST API. This allows you to access these tables from any client that supports Iceberg REST APIs without introducing new dependencies.

To enable accessing your UniForm Iceberg tables in Databricks UC from the local terminal, enter the following command to launch the spark-sql shell:

./spark-sql --name "s3-uc-test" \
	--master "local[*]" \
	--packages "org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.6.1,io.delta:delta-spark_2.12:3.2.1,io.unitycatalog:unitycatalog-spark_2.12:0.2.0,org.apache.hadoop:hadoop-common:3.3.4,org.apache.hadoop:hadoop-aws:3.3.4" \
	--conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" \
	--conf "spark.sql.catalog.spark_catalog=io.unitycatalog.spark.UCSingleCatalog" \
	--conf "spark.hadoop.fs.s3.impl=org.apache.hadoop.fs.s3a.S3AFileSystem" \
	--conf "spark.sql.catalog.<<Your Default UC Catalog>>=io.unitycatalog.spark.UCSingleCatalog" \
	--conf "spark.sql.catalog.<<Your Default UC Catalog>>.uri=https://<<Your Databricks Workspace URL>>/api/2.1/unity-catalog" \
	--conf "spark.sql.catalog.<<Your Default UC Catalog>>.token=<<Your Databricks Workspace PAT>>" \
	--conf "spark.sql.defaultCatalog=<<Your Default UC Catalog>>"

Notice the inclusion of an additional package, org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.6.1, for Iceberg. This is the only change from the previous configuration shown for launching the spark-sql shell.

Step 6: Performing read operations on Iceberg tables whose data is stored in cloud storage

Let’s run some queries against a UniForm Iceberg table registered in Databricks UC. First, we'll create a UniForm Iceberg table. We will do this from the Databricks workspace:

CREATE TABLE iceberg_tab(id INT, desc string)
location "s3://<<Your AWS S3 Location>>"
TBLPROPERTIES(
'delta.enableIcebergCompatV2' = 'true',
'delta.universalFormat.enabledFormats' = 'iceberg');

We can now execute the following commands in the local terminal to describe the newly created iceberg table:

show schemas;
use db_schema;
show tables;
desc extended iceberg_tab;

Here’s what we’ll see:

Next, let’s insert some data into the UniForm Iceberg table, via the Databricks workspace:

insert into iceberg_tab values (1,'a'),(2,'b'),(3,'c'),(4,'d'),(5,'e');

We can read from this table in the local terminal or the Databricks workspace using command like the following:

select * from iceberg_tab order by id;

Finally, let’s update the UniForm Iceberg table from the Databricks workspace and read from it again. Here’s the query we ran to update the table from the Databricks workspace:

update iceberg_tab set desc = 'updated' where id = 3;

And here’s what we see when we read the table from the local terminal and the Databricks workspace respectively:

Conclusion

This blog showed you how to use Apache Spark from an external (non-Databricks) processing engine to securely perform CRUD operations on your Delta tables registered in a Databricks Unity Catalog metastore, using UC’s open source REST APIs. We also looked at how you can read your UniForm Iceberg tables registered in Databrick UC using the Iceberg REST API. Try out Unity Catalog’s open APIs today to access and process your data securely from any external engine using Apache Spark!

Databricks Community

Integrating Apache Spark™ with Databricks Unity Catalog Assets via Open APIs

Securing Access Requests from External Engines

Experiencing Apache Spark in Action with Unity Catalog’s Open APIs

Step 1: Setting up Apache Spark on the local workstation

Step 2: Accessing Databricks UC from the local terminal

Accessing Databricks UC from the Spark SQL shell

Accessing Databricks UC from the PySpark shell

Step 3: Performing CRUD operations on Delta tables whose data is stored in cloud storage

Step 4: Performing a UC access control test

Step 5: Accessing UniForm Iceberg tables in Databricks UC from the local terminal

Step 6: Performing read operations on Iceberg tables whose data is stored in cloud storage

Conclusion

Metadata-Driven ETL Framework in Databricks (Part-1)

Top 10 query performance tuning tips for Databricks Serverless SQL

Best practices for safe data experimentation with Databricks