cancel
Showing results for 
Search instead for 
Did you mean: 
Technical Blog
Explore in-depth articles, tutorials, and insights on data analytics and machine learning in the Databricks Technical Blog. Stay updated on industry trends, best practices, and advanced techniques.
cancel
Showing results for 
Search instead for 
Did you mean: 
dkushari
Databricks Employee
Databricks Employee

Databricks Unity Catalog (UC) is the industry’s only unified and open governance solution for data and AI, built into the Databricks Data Intelligence Platform. Unity Catalog provides a single source of truth for your organization’s data and AI assets, providing open connectivity to any data source, any format, unified governance with detailed lineage tracking, comprehensive monitoring, and support for open sharing and collaboration.

With its open APIs and introduction of credential vending, Databricks Unity Catalog data can be read by external engines and interfaces such as Iceberg REST APIs, DuckDB, Apache Spark™, Trino. 

In this blog, we explore how you can use Apache Spark from an external (non-Databricks) processing engine to securely perform CRUD (Create, Read, Update, and Delete) operations on your tables registered in a Databricks Unity Catalog metastore, using UC’s open source REST APIs. 

You can now use Spark SQL and DataFrame APIs to operate on Databricks Unity Catalog tables from an external processing engine, without having to configure your entire Spark application with one set of credentials to allow access to all your tables. Instead, the Spark integration will automatically acquire per-table credentials from UC (assuming the user has the necessary permissions) when running your Spark jobs. 

If you’d like to learn how you can set up your own Unity Catalog server and use Apache Spark™ from an external (non-Databricks) processing engine to securely perform CRUD operations on your Delta tables registered in a Unity Catalog OSS metastore using UC’s open source REST APIs, please refer to this blog.

Securing Access Requests from External Engines

When Apache Spark requests access to data in a table registered in a Databricks UC metastore from an external processing engine, Unity Catalog issues short-lived credentials and URLs to control storage access based on the user’s specific IAM roles or managed identities, enabling data retrieval and query execution. The detailed steps are captured in the diagram below.

dkushari_0-1730724640060.png

Experiencing Apache Spark in Action with Unity Catalog’s Open APIs

In this section, we’ll look at how you can perform CRUD operations on tables registered in Databricks Unity Catalog using Spark SQL and PySpark DataFrame APIs. We’ll walk through the following steps:

  1. Setting up Apache Spark on the local workstation
  2. Accessing Databricks UC from the local terminal
  3. Performing CRUD operations on Delta tables whose data is stored in cloud storage
  4. Performing a UC access control test
  5. Accessing UniForm Iceberg tables in Databricks UC from the local terminal
  6. Performing read operations on Iceberg tables whose data is stored in cloud storage

Step 1: Setting up Apache Spark on the local workstation

The first step is to download and configure Apache Spark. You can download the latest version of Spark (>= 3.5.3) using a command like the following:

curl -O https://archive.apache.org/dist/spark/spark-3.5.3/spark-3.5.3-bin-hadoop3.tgz

Next, untar the package using the following command (for the rest of this tutorial, I’ll assume you’re using Spark 3.5.3):

tar xzf spark-3.5.3-bin-hadoop3.tgz

Step 2: Accessing Databricks UC from the local terminal

You can access Databricks UC from Apache Spark via the terminal using the Spark SQL shell or the PySpark shell.

Accessing Databricks UC from the Spark SQL shell

To use the Spark SQL shell (bin/spark-sql), go into the bin folder inside the downloaded Apache Spark folder (spark-3.5.3-bin-hadoop3) in your terminal:

cd spark-3.5.3-bin-hadoop3/bin

Once you’re inside the bin folder, run the following command to launch the spark-sql shell (see below for a discussion of the packages and configuration options): 

./spark-sql --name "s3-uc-test" \
	--master "local[*]" \
	--repositories "https://s01.oss.sonatype.org/content/repositories/iounitycatalog-1016/" \
	--packages "org.apache.hadoop:hadoop-aws:3.3.4,org.apache.hadoop:hadoop-common:3.3.4,io.delta:delta-spark_2.12:3.2.1,io.unitycatalog:unitycatalog-spark_2.12:0.2.0" \
	--conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" \
	--conf "spark.sql.catalog.spark_catalog=io.unitycatalog.spark.UCSingleCatalog" \
	--conf "spark.hadoop.fs.s3.impl=org.apache.hadoop.fs.s3a.S3AFileSystem" \
	--conf "spark.sql.catalog.<<Your Default UC Catalog>>=io.unitycatalog.spark.UCSingleCatalog" \
	--conf "spark.sql.catalog.<<Your Default UC Catalog>>.uri=https://<<Your Databricks Workspace URL>>/api/2.1/unity-catalog" \
	--conf "spark.sql.catalog.<<Your Default UC Catalog>>.token=<<Your Databricks Workspace PAT>>" \
	--conf "spark.sql.defaultCatalog=<<Your Default UC Catalog>>"

Note the following items in this command:

  • --packages points to the delta-spark and unitycatalog-spark packages.
  • spark.sql.defaultCatalog=<<Your Default UC Catalog>> must be filled out to indicate the default catalog you want to use when launching the spark-sql shell.
  • spark.sql.catalog.<<Your Default UC Catalog>>.uri points to the Databricks UC REST API endpoint for the workspace.
  • spark.sql.catalog.<<Your Default UC Catalog>>.token is your Databricks workspace personal access token (PAT), which is used to authenticate you as a legitimate user to the Databricks platform. Access to the UC objects is controlled via the UC permissions model.
  • spark.hadoop.fs.s3.impl=org.apache.hadoop.fs.s3a.S3AFileSystem must be set to access your cloud object storage (in this example, AWS S3).

Now you’re ready to perform operations using Spark SQL in Databricks UC.

Accessing Databricks UC from the PySpark shell

To use the PySpark shell (bin/pyspark), go into the bin folder inside your downloaded Apache Spark folder (spark-3.5.3-bin-hadoop3) in your terminal:

cd spark-3.5.3-bin-hadoop3/bin

Once you’re inside the bin folder, run the following command to launch the pyspark shell (see the previous section for a discussion of the packages and configuration options):

./pyspark --name "s3-uc-test" \
	--master "local[*]" \
	--repositories "https://s01.oss.sonatype.org/content/repositories/iounitycatalog-1016/" \
	--packages "org.apache.hadoop:hadoop-aws:3.3.4,org.apache.hadoop:hadoop-common:3.3.4,io.delta:delta-spark_2.12:3.2.1,io.unitycatalog:unitycatalog-spark_2.12:0.2.0" \
	--conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" \
	--conf "spark.sql.catalog.spark_catalog=io.unitycatalog.spark.UCSingleCatalog" \
	--conf "spark.hadoop.fs.s3.impl=org.apache.hadoop.fs.s3a.S3AFileSystem" \
	--conf "spark.sql.catalog.<<Your Default UC Catalog>>=io.unitycatalog.spark.UCSingleCatalog" \
	--conf "spark.sql.catalog.<<Your Default UC Catalog>>.uri=https://<<Your Databricks Workspace URL>>/api/2.1/unity-catalog" \
	--conf "spark.sql.catalog.<<Your Default UC Catalog>>.token=<<Your Databricks Workspace PAT>>" \
	--conf "spark.sql.defaultCatalog=<<Your Default UC Catalog>>"

Now you’re ready to perform operations using PySpark in Databricks UC.

Step 3: Performing CRUD operations on Delta tables whose data is stored in cloud storage

In this step, I’ll walk you through performing some CRUD operations on Databricks UC tables. I’ll use Spark SQL here, but the same SQL commands can be run in the PySpark shell, embedding them inside spark.sql(). You can also use PySpark DataFrame APIs to perform DML (Data Manipulation Language) operations on the Databricks UC tables.

Here are some of the commands you can run inside the spark-sql shell, including example output for some of them:

  • See all the accessible UC catalogs:
SHOW CATALOGS;
  • See all the available schemas in the default (<<Your Default UC Catalog>>) catalog you used when launching the spark-sql shell: 
SHOW SCHEMAS;

dkushari_1-1730724640079.png

  • Create a new schema:
create schema db_schema;
  • Use this new schema as the default schema: 
use schema db_schema;
  • Create a new external table:
CREATE external TABLE
db_uc_table (id INT, desc STRING)
USING delta
LOCATION 's3://<<Your AWS S3 Location>>';
  • Describe the newly created table:
desc extended db_uc_table;

dkushari_2-1730724640083.png

For comparison, here’s how this information looks like in the Databricks workspace:

dkushari_3-1730724640061.png

  • Insert some records from the local terminal into the newly created table and select from that table:
insert into db_uc_table values (1,'a'),(2,'b'), (3,'c'),(4,'d'),(5,'e');
select * from db_uc_table order by id;


dkushari_4-1730724640055.png

Here’s what you’ll see if you explore the same table from the Databricks workspace:

dkushari_5-1730724640032.png

dkushari_6-1730724640082.png

dkushari_7-1730724640032.png

Next, let’s create a new managed table in the same UC catalog and schema from the Databricks workspace and read it from the local terminal. Here’s what you’ll see in the spark-sql shell before the managed table is created:

dkushari_8-1730724640032.png

Now, create the managed table from the Databricks workspace:

dkushari_9-1730724640048.png

Then run the show tables command from the local terminal again. Now the managed table is in the list: 

dkushari_10-1730724640094.png

You can insert data into the managed table from the local terminal as well as the Databricks workspace. First, insert and select some data from the local terminal:

insert into db_uc_table_managed values (1,'a');
select * from db_uc_table_managed;

dkushari_11-1730724639936.png

dkushari_12-1730724639945.png

You can select the same data from the Databricks workspace:

dkushari_13-1730724640003.png

And here we are inserting data into the managed table from the Databricks workspace:

dkushari_14-1730724640053.png

Here’s another example of selecting data from the managed table from both the local terminal and the Databricks workspace:

dkushari_15-1730724640080.png

dkushari_16-1730724640073.png

Now let’s try performing some upsert activities on the external table from the local terminal and the Databricks workspace. First, let's perform a delete activity on the external table from the local terminal: 

delete from db_uc_table where id in (1,3);

Here we are selecting data from the external table using the local terminal and the Databricks workspace:

dkushari_17-1730724640023.png

dkushari_18-1730724640033.png

dkushari_19-1730724640034.png

Next, let's perform an update activity on the managed table from the Databricks workspace:

dkushari_20-1730724640054.png

And select the data from the managed table using the local terminal. You can notice the change reflected for the record with id = 3:

dkushari_21-1730724639974.png

We can show the history of the changes in the managed table due to DML operations from the local terminal and view it in the Databricks workspace UI as well:

desc history db_uc_table_managed;

dkushari_22-1730724640106.png

dkushari_23-1730724640079.png

Step 4: Performing a UC access control test

Next, let’s do a quick test to verify that access control is working as we expect. To perform this test, we’re going to make a change to the UC permissions for the Databricks authenticated user who is accessing the spark-sql shell from the local terminal. Namely, we’ll remove their access to the UC object (i.e., the Delta table) we’ve been working with.

Let's change the owner of the external Delta table to a service principal instead of the current authenticated named user. This means the named user no longer has SELECT permission on the table.

dkushari_24-1730724640052.png

dkushari_25-1730724639872.png

This query from the Databricks workspace will now fail with insufficient permissions, as expected:

dkushari_26-1730724640011.png

The same query from the local terminal fails with insufficient permissions as well:

dkushari_27-1730724640054.png

This shows that the user who was authenticated using the Databricks PAT also requires proper UC authorization to access the data governed by Unity Catalog. Without the proper UC permission, access to the UC object will be denied.

Step 5: Accessing UniForm Iceberg tables in Databricks UC from the local terminal

You can query UniForm Iceberg tables from Databricks Unity Catalog through the Iceberg REST API. This allows you to access these tables from any client that supports Iceberg REST APIs without introducing new dependencies.

To enable accessing your UniForm Iceberg tables in Databricks UC from the local terminal, enter the following command to launch the spark-sql shell:

./spark-sql --name "s3-uc-test" \
	--master "local[*]" \
	--repositories "https://s01.oss.sonatype.org/content/repositories/iounitycatalog-1016/" \
	--packages "org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.6.1,io.delta:delta-spark_2.12:3.2.1,io.unitycatalog:unitycatalog-spark_2.12:0.2.0,org.apache.hadoop:hadoop-common:3.3.4,org.apache.hadoop:hadoop-aws:3.3.4" \
	--conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" \
	--conf "spark.sql.catalog.spark_catalog=io.unitycatalog.spark.UCSingleCatalog" \
	--conf "spark.hadoop.fs.s3.impl=org.apache.hadoop.fs.s3a.S3AFileSystem" \
	--conf "spark.sql.catalog.<<Your Default UC Catalog>>=io.unitycatalog.spark.UCSingleCatalog" \
	--conf "spark.sql.catalog.<<Your Default UC Catalog>>.uri=https://<<Your Databricks Workspace URL>>/api/2.1/unity-catalog" \
	--conf "spark.sql.catalog.<<Your Default UC Catalog>>.token=<<Your Databricks Workspace PAT>>" \
	--conf "spark.sql.defaultCatalog=<<Your Default UC Catalog>>"

Notice the inclusion of an additional package, org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.6.1, for Iceberg. This is the only change from the previous configuration shown for launching the spark-sql shell.

Step 6: Performing read operations on Iceberg tables whose data is stored in cloud storage

Let’s run some queries against a UniForm Iceberg table registered in Databricks UC. First, we'll create a UniForm Iceberg table. We can do this from the local terminal or the Databricks workspace:

CREATE TABLE iceberg_tab(id INT, desc string)
location "s3://<<Your AWS S3 Location>>"
TBLPROPERTIES(
'delta.enableIcebergCompatV2' = 'true',
'delta.universalFormat.enabledFormats' = 'iceberg');

dkushari_28-1730724640080.png

We can now execute the following commands in the local terminal:

show schemas;
use db_schema;
show tables;
desc extended iceberg_tab;

Here’s what we’ll see:

dkushari_29-1730724640111.png

Next, let’s insert some data into the UniForm Iceberg table, via the Databricks workspace:

insert into iceberg_tab values (1,'a'),(2,'b'),(3,'c'),(4,'d'),(5,'e');

dkushari_30-1730724639998.png

We can read from this table in the local terminal or the Databricks workspace using commands like the following:

select * from iceberg_tab order by id;

dkushari_31-1730724640051.png

dkushari_32-1730724640007.png

Finally, let’s update the UniForm Iceberg table and read from it again. Here’s the query we ran to update the table (as usual, you can enter this in the local terminal or the Databricks workspace, as shown here):

update iceberg_tab set desc = 'updated' where id = 3;

dkushari_33-1730724639966.png

And here’s what we see when we read from the table:

dkushari_34-1730724640013.png

dkushari_35-1730724639983.png

Conclusion

This blog showed you how to use Apache Spark from an external (non-Databricks) processing engine to securely perform CRUD operations on your Delta tables registered in a Databricks Unity Catalog metastore, using UC’s open source REST APIs. We also looked at how you can read your UniForm Iceberg tables registered in Databrick UC using the Iceberg REST API. Try out Unity Catalog’s open APIs today to access and process your data securely from any external engine using Apache Spark!

2 Comments
Klugscheißer
New Contributor

This looks exactly like what I've been hoping for! However, the tutorial as shown here doesn't seem to work for me.

  1. The repo https://s01.oss.sonatype.org/content/repositories/iounitycatalog-1016/ does not exist and 404s. Luckily, unitycatalog-spark is published elsewhere on sonatype
  2. There seems to be a version mismatch in the dependent hadoop libraries.
    • Spark 3.5.{0-3} are all compiled against hadoop 3.3.4.
    • unitycatalog-spark versions 0.2.{0,1} depend on spark 3.5.3.
    • unitycatalog-spark also depend on hadoop-client-runtime 3.4.0
    • hadoop libraries 3.4.0 have an updated FSBuilder that has an `optLong` method that is not present in 3.3.4. This results in `java.lang.NoSuchMethodError` if this tutorial is attempted with spark 3.3.4 and `java.lang.NumberFormatException` if upgraded to 3.4.0.

Is this tutorial possible or was unitycatalog-spark released with incompatible dependencies? Thanks!

 

Klugscheißer
New Contributor

I was able to resolve the above dependency issues by explicitly depending on hadoop-aws 3.4.1 in addition to `spark-hadoop-cloud` and was able to read and write to tables in UC from OSS spark!

However, I'm now having difficulty with the Uniform part. If I only add a dependency on `iceberg-spark-runtime-3.5`, then writing to a Uniform iceberg table fails with

ERROR org.apache.spark.sql.delta.DeltaLog - Failed to find Iceberg converter class
java.lang.ClassNotFoundException: org.apache.spark.sql.delta.icebergShaded.IcebergConverter

If I add `delta-iceberg` then the IcebergConverter class is found, reading and writing to the table works, however the iceberg conversion on write fails with

ERROR org.apache.spark.sql.delta.icebergShaded.IcebergConverter - Error when converting to Iceberg metadata
org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: [SCHEMA_NOT_FOUND] The schema `test_schema` cannot be found. Verify the spelling and correctness of the schema and catalog.
[info] If you did not qualify the name with a catalog, verify the current_schema() output, or qualify the name with the correct catalog.

Possibly related, while `SHOW SCHEMAS` does list the `test_schema` I've been using and `SELECT current_schema()` does indeed return `test_schema`, `spark.sessionState.catalog.listDatabases()` does not! Reading the code, it seems like this is what's referenced during the iceberg conversion. It's not clear to me why the schema / db is not consistent across various spark calls.