Databricks Community

pascal_vogel · ‎08-28-2024

This post is written by Pascal Vogel, Solutions Architect, and Kiryl Halozhyn, Senior Solutions Architect.

The Databricks Data Intelligence Platform allows your entire organization to use data and AI. It’s built on a lakehouse architecture to provide an open, unified foundation for all data and governance.

Data stored in Databricks can be securely accessed and shared across platforms, clouds and regions thanks to capabilities such as Delta Sharing, Unity Catalog’s open APIs, and Delta UniForm with unified governance provided by Unity Catalog.

This enables you to quickly and securely integrate Databricks with your existing cloud-based data environments, for instance for data engineering, SQL analytics, and machine learning.

Databricks can be deployed on Amazon AWS, Microsoft Azure, and Google Cloud Platform and deeply integrates with these cloud platforms to enable usage of all available native tools.

In the case of AWS, data processed and stored in Databricks can serve as the foundation for data exploration and feature engineering using Amazon SageMaker. SageMaker notebooks, fully managed notebooks in JupyterLab, can be used for exploring data and building ML models.

Several options are available to securely access data stored in Databricks from a SageMaker notebook while maintaining Unity Catalog governance with permissions enforcement, auditability, and observability.

This blogpost introduces and compares four different options for accessing data governed by Unity Catalog using SageMaker notebooks.

Solution overview

Available options for accessing Databricks data from Amazon SageMaker

This blog post presents four methods for securely accessing data governed by Unity Catalog from Amazon SageMaker notebooks:

Delta Sharing: share data with SageMaker using the open Delta Sharing protocol.
Databricks Connect: connect to a Databricks cluster from SageMaker.
Databricks SQL Connector: connect to a Databricks SQL warehouse from SageMaker.
Unity Catalog Open APIs: directly access data governed by Unity Catalog using a query engine.

The scenario assumes you have a Unity Catalog enabled workspace and read access to a table that you want to access in a SageMaker notebook. In this blog post, the table marketing.campaigns.leads registered in Unity Catalog is used as an example.

The instructions in this blog are based on a SageMaker notebook instance running Amazon Linux 2 with JupyterLab 3 and the conda_python3 kernel. For simplicity, in this blog a Databricks personal access token stored in an environment variable is used to securely authenticate with Databricks. Depending on your Databricks workspace setup, there are other authentication methods you can use including Oauth.

The following sections provide setup instructions and discuss the benefits and limitations of each option.

Delta Sharing

Delta Sharing is an open protocol for secure data sharing, making it simple to share data with other organizations regardless of which computing platforms they use. You can use Delta Sharing to share live data across platforms, clouds and regions with strong security and governance.

While Delta Sharing is available as an open-source project for sharing tabular data, using it in Databricks adds on the ability to share non-tabular, unstructured data (volumes), AI models, views, filtered data, and notebooks. Delta Sharing does not require running compute on the provider side as it provides a short-lived URL directly to data files in object storage, while adhering to Unity Catalog governance.

You can use Delta Sharing to share data with recipients who also use Databricks (Databricks-to-Databricks sharing) or recipients outside of Databricks (open sharing) that use one of many Delta Sharing integrations.

Accessing a Delta Share from a SageMaker notebook

As the first step, set up a delta share for the data assets you want to access from SageMaker following the steps described in Create and manage shares for Delta Sharing. You can think of delta share as a collection of multiple assets you want to share with a list of recipients.

After setting up the share, you can download a credential file via the generated activation URL.

Upload the credential file to your SageMaker notebook JupyterLab environment using the file browser.

In the notebook, install the Python connector for Delta Sharing using the delta-sharing Python package:

!pip install delta-sharing

You can now confirm which tables are available in the share:

import delta_sharing

client = delta_sharing.SharingClient("config.share")

client.list_all_tables()

To read data from the shared table:

table_url = f"{profile_file}#marketing_share.campaigns.leads"
data = delta_sharing.load_as_pandas(table_url, limit=10)

display(data)

Delta Sharing is a suitable option to access Databricks data governed by Unity Catalog if your focus is on reading only the shared data and you do not need to write back into Unity Catalog. Besides structured data, you can also use Delta Sharing to access unstructured data governed by Unity Catalog Volumes from SageMaker.

This option does not require running compute on Databricks and the recipient will always get the most up-to-date state of the data without silos which ensures simplicity and cost-effectiveness. However, the data must be stored in Delta format and its size should be reasonable enough for SageMaker to cope with it as all computations will be performed on the notebook instance. Advanced data security measures, row-level filtering and column masking, are currently not supported for shared data assets as well as some of the recent Delta Lake features like deletion vectors.

Additionally, modifying the list of shared assets requires share owner permission in the Unity Catalog, which is often assigned to privileged users only and can slow down the change process.

Databricks Connect

Databricks Connect allows you to connect IDEs such as Visual Studio Code, PyCharm, RStudio Desktop, IntelliJ IDEA, notebook servers, and other custom applications to Databricks clusters.

As Databricks Connect also supports connecting to serverless compute, you do not need to set up and manage a cluster to connect from SageMaker using this option while having flexible compute up and running in seconds.

In a SageMaker notebook, you can use the databricks-connect Python package to connect to a cluster and securely access data governed by Unity Catalog.

See also Databricks Connect for Python for detailed requirements and setup instructions.

Connect to a Databricks cluster from a SageMaker notebook

If you do not have access to a Databricks cluster, follow these instructions to create a cluster with Databricks Runtime 15.4 LTS.

In your SageMaker notebook, install the required packages:

!pip uninstall pyspark -y
!pip install --upgrade "databricks-connect==15.3.1"

Next, get the hostname and cluster ID for your cluster.

On your SageMaker notebook instance, set the DATABRICKS_TOKEN environment variable to a valid Databricks PAT.

To read data from a table in Unity Catalog using your cluster:

import os
from databricks.connect import DatabricksSession
from databricks.sdk.core import Config

config = Config(
   host = "https://<workspace_name>.cloud.databricks.com/",
   cluster_id = "<cluster_id>"
)

spark = DatabricksSession.builder.sdkConfig(config).getOrCreate()

df = spark.read.table("catalog.schema.table")

df.show()

If your Databricks workspace is enabled for serverless compute, you do not need to specify a cluster ID and can simply connect to a serverless compute resource:

import os
from databricks.connect import DatabricksSession
from databricks.sdk.core import Config

config = Config(
   host = "https://<workspace_name>.cloud.databricks.com/",
   serverless_compute_id = "auto",
)

spark = DatabricksSession.builder.sdkConfig(config).getOrCreate()

df = spark.read.table("catalog.schema.table")

df.show()

Note that you can further simplify the connection setup by setting additional environment variables or setting up a configuration profile.

With Databricks connect, you get access to a full Spark environment from SageMaker, can utilize the compute capacity of your remote Databricks cluster, and are not limited by the compute resources of your SageMaker notebook instance.

Thanks to the serverless compute option, there is no effort for setting up or managing a cluster. With access to a full Spark environment, you can read as well as write data in Unity Catalog.

Using this option, the underlying data format is not limited just to Delta as you get access to the entire data estate which the user is allowed to access as if you were working in Databricks. Thus, one can also benefit from having row-level security and column masking policies applied to the table.

Databricks SQL connector

The Databricks SQL Connector for Python is a Python library that allows you to use Python code to run SQL commands on Databricks clusters and Databricks SQL warehouses.

A SQL warehouse is a compute resource that lets you query and explore data on Databricks. Databricks recommends using a serverless SQL warehouse to benefit from instant and elastic compute and minimal management overhead.

Connect to a Databricks SQL warehouse from a SageMaker notebook

If you do not have access to a Databricks SQL warehouse, follow these instructions to create one.

Next, get the connection details (hostname and http_path) for your warehouse.

In a SageMaker notebook, you can access the Databricks SQL Connector for Python by installing the databricks-sql-connector Python package:

!pip install "databricks-sql-connector==3.3.0"

On your SageMaker notebook instance, set the DATABRICKS_TOKEN environment variable to a valid Databricks PAT.

To read data from a Unity Catalog table using a SQL warehouse:

import os
from databricks import sql

connection = sql.connect(
  server_hostname="<workspace name>.cloud.databricks.com",
  http_path="/sql/1.0/warehouses/<warehouse ID>",
)

cursor = connection.cursor()

cursor.execute('SELECT * FROM marketing.campaigns.leads')

result = cursor.fetchall()

for row in result:
  print(row)

cursor.close()

connection.close()

Note that you can further simplify the connection setup by setting additional environment variables.

The Databricks SQL connector is a simple option for accessing Databricks data from a SageMaker notebook if you would like to express your data needs in SQL. You can execute data operations such as transforms or aggregations on the SQL warehouse and access the results in your notebook.

You are not limited by the compute resources of your SageMaker notebook instance but can rely on scalable compute in Databricks if needed. As you can execute any SQL query on the warehouse, you can read as well as write data using this approach. Similar to the previous option, all Unity Catalog features are supported in this option and there are no limitations on the data format.

Unity Catalog Open APIs

Unity Catalog provides centralized access control, auditing, lineage, and data discovery capabilities across Databricks workspaces. With its open APIs, data cataloged in Unity can be read by virtually any compute engine. A growing number of partners are developing Unity Catalog-compatible API clients that you can use to directly query data governed by Unity Catalog.

The following example uses the Daft query engine with its Delta Lake and Unity Catalog integration to read data governed by Unity Catalog in Databricks.

Access data in Unity Catalog using the Daft query engine

In your SageMaker notebook, install the necessary Python package:

!pip install getdaft[unity,deltalake]

To read data from a Unity Catalog table:

import os,daft
from daft.unity_catalog import UnityCatalog

unity = UnityCatalog(
    endpoint="https://<workspace name>.cloud.databricks.com/",
    token=os.getenv("PAT"),
)

unity_table = unity.load_table("marketing.campaigns.leads")

df = daft.read_delta_lake(unity_table)

df.show()

With the Unity Catalog open APIs, you can easily access Databricks data using a wide range of query engines. There is no need to set up, manage or keep running any Databricks clusters or SQL warehouses. Currently, in the case of managed tables, you are limited to reading data using the Unity Catalog open APIs and cannot write data using this approach. In the case of external tables, you can both read and write data using the Unity Catalog open APIs.

Summary

Accessing data governed by Unity Catalog using Amazon SageMaker notebooks can be achieved through various methods, each with its own set of benefits and limitations:

Option	Cost	Limitations	Simplicity	When to use?
Delta Sharing	$	🔴	🟠	A read-only use case for small data and setup performed by Workspace admins.
Databricks Connect	$$	🟢	🟢	Any use case where the compute power of Databricks is required.
Databricks SQL client	$$	🟢	🟢	Similar to Databricks Connect, but for sql-only users.
Unity Catalog Open APIs	$	🟠*	🟢	Currently for read-only use cases with a simple setup and the ability to bring your engine of choice.

* Write support for managed tables via Unity Catalog open APIs is on the roadmap

Selecting the right approach depends on your specific use case, data access requirements, and preferred tools. As Unity Catalog Open APIs allow you to connect directly to the metastore from any external engine, we recommend it as a starting point for small to medium size data assets. For larger or more complex data assets, consider Databricks Connect to benefit from the power and flexibility of Databricks compute.

Both of these options are not tied to a specific programming language or dependent on supported features of the Delta Sharing protocol, thus providing the full spectrum of Unity Catalog Delta Table features.

Whether using Delta Sharing for cross-platform data sharing, Databricks Connect for direct integration with Databricks clusters, the SQL connector for running SQL queries, or the Unity Catalog Open APIs for flexible data access, these options provide secure and governed ways to integrate Databricks data with SageMaker.

Databricks Community

How to access data in Databricks from Amazon SageMaker notebooks

Solution overview

Delta Sharing

Accessing a Delta Share from a SageMaker notebook

Databricks Connect

Connect to a Databricks cluster from a SageMaker notebook

Databricks SQL connector

Connect to a Databricks SQL warehouse from a SageMaker notebook

Unity Catalog Open APIs

Access data in Unity Catalog using the Daft query engine

Summary

Best practices for safe data experimentation with Databricks

Top 10 query performance tuning tips for Databricks Serverless SQL

Metadata-Driven ETL Framework in Databricks (Part-1)