Databricks Community

lammic · ‎05-02-2024

Mount points are a convenient pattern to facilitate data access in Databricks, while hiding the complexity of accessing cloud storage directly. Unity Catalog improves the governance of file-based cloud storage with Volumes.

In this blog post, we use an example to discuss the steps involved in planning and migrating to Volumes.

Mount points link remote storage to local storage

The Databricks documentation explains mount points as enablers for users “to mount cloud object storage to the Databricks File System (DBFS) to simplify data access patterns for users that are unfamiliar with cloud concepts.” Now, imagine you need to access unstructured data from cloud storage managed by another team or another external party. How would access control work in practice?

Let’s take AWS and S3 storage as an example: Databricks clusters leverage IAM roles to access the different mount points. Therefore, any user on the Databricks workspace with access to the respective clusters can now access these mount points. In Azure, you create mount points via Microsoft Entra ID application service principals; once mounted, all users will have access to that storage account. When rotating the service principal secret, you must unmount and remount the storage.

Unity Catalog on the Databricks Intelligence Platform offers a higher, finer-grained level of governance compared to mount points. With mount points, access control is managed through compute configurations, typically overseen by central IT teams. This approach ties access to the compute resources rather than the data itself, affecting governance. By migrating to Unity Catalog Volumes, you can grant direct access to the data assets, enabling more decentralized governance.

Unity Catalog governs volumes

To close this gap, Unity Catalog introduces the concept of Volumes. Volumes extend the governance layer in the lakehouse from tabular data to any data on storage. This is a critical capability when, for example, building AI applications that source various data shapes and formats.

Migration process

Business fit and continuity need to be at the heart of every migration strategy, and good planning is therefore crucial. In times before Unity Catalog, mount points were often used for a twofold purpose:

To read tabular data from paths to the object storage. In Unity Catalog, this pattern is covered by external tables.
To read unstructured data in the object storage. In Unity Catalog, this pattern is covered by the concept of Volumes.

In the following sections, we will explore in more detail how to assess, plan, and execute our migration from mounts to Unity Catalog Volumes.

Tip!

Databricks Labs has released UCX, a companion tool for upgrading to Unity Catalog. UCX automates and simplifies migrations to Unity Catalog; UCX does not cover volumes yet (see this feature request). Until that feature is available, please look at the steps outlined in the “execution” section below.

Assessment

Before you migrate, you need a high-level overview of what to migrate. The assessment phase consists of collating information about the existing mount points including but not limited to: the underlying storage location, impacted code assets (e.g. notebooks, queries, etc.), and where each mount point is being used.

You can get information about mount points via dbutils.fs.mounts() or by running the UCX assessment. Listing the assets requires “grepping” (searching) across the code base.

Example output below:

Mount point	Cloud storage location	Code assets impacted (notebooks, etc.)
/mnt/domainAimages	s3://one-blog-post-mount-location-one/images/	/Workspace/Repos/michele@databricks.com/image-analysis
/mnt/domainBdata	abfss://domainb@oneblogpost.dfs.core.windows.net/	/Workspace/Repos/michele@databricks.com/read-raw-blobs
…	…	…

Planning

For each mount point, you will need to decide:

Whether you still need it! There’s no need to migrate mount points that are not in use anymore
Whether the capability is better served and governed as an external table or volume in the Unity Catalog

The discriminant is the content, whether it is structured or unstructured data. For unstructured data, mark it as Volume (V); for structured data, mark it as External Table (ET).

Depending on the number of mount points, storage locations, and business considerations, you may execute the migration in one step or multiple phases. For each mount point, indicate which phase you plan to migrate it in.

Example:

Mount point	Target (ET / V / remove)	Phase
/mnt/domainAimages	V	1
/mnt/domainBdata	ET	2
…	…	…

Execution

To help illustrate the steps involved in migrating from mounts to UC Volumes, let us paint the below scenario.

You have already built an LLM chatbot with RAG that leverages internal confidential documents stored in PDF format. These documents are available from an external location and are currently accessed as a DBFS mount point.

By migrating to UC Volumes, you will gain finer grained access control on these confidential documents and are able to (via data lineage) audit the flow of information to keep track of how the PDF content feeds the LLM chatbot.

In the current configuration, the PDF documents are available under the mount point

/mnt/michele/

which is pointing to the Amazon S3 location

s3://one-blog-post-mount-location/michele/

Tip!

Use a table to keep track of mount points and their new Volume equivalent. Keeping track of progress helps you programmatically test and update code sections later.

You should also treat this as an opportunity to simplify access. For example, if multiple mount points share the same root (at the storage level); you can re-use the same storage credential for multiple Volumes.

1. Set up storage credentials

To migrate toward a UC Volumes, you need to first make sure you can access the data from the external cloud storage location. To do this, you create a storage credential as in the example below:

Here, you provide an authentication and authorization mechanism which is dependent on the cloud provider. On AWS you would use an AWS IAM role, on Azure a managed identity, or on Google Cloud a service account.

2. Set up external locations

Next, you need to create the relevant external locations, as in the example below:

Then, verify you can access the external location.

3. Create UC Volume

Once connection has been established successfully, you can create the UC volume with:

And verify its creation:

Notice that the UC volume location is a subfolder of the external location. You can reuse the same external location to create multiple volumes as subfolders. You can manage permissions for each volume separately, even though they share the same parent location.

Note!

The UC volume location is a subfolder of the external location. You can reuse the same external location to create multiple volumes as subfolders. You can then manage permissions for each volume separately, even though they share the same parent location.

At this stage, you can access the same cloud storage location in S3 via mount point and UC (external volume). If you upload a file via UC volume, the file will be available via the mount point and vice versa. This allows you to migrate the mount points with zero downtime for your application.

4. Adjust code and clean up

With your volume in place, you now need to update your code to use the volume instead of the mount point. If there are only a few code artifacts, it might be simpler and quicker to do this manually.

Example:

Reading the PDF file as Spark DataFrame with mount points:

pdf_df = spark.read.format('binaryFile').load('/mnt/michele/TenThings_Supplemental.pdf')

After the migration to UC volume:

pdf_df = spark.read.format('binaryFile').load('/Volumes/michele/default/externalvolume/TenThings_Supplemental.pdf')

However, if you have multiple notebooks and files in your workspace, please consider scanning all of them and replacing the respective paths until UCX implements the following feature request.

Use the mapping table you created during the planning phase to track your migration progress as you eliminate the need for each mount point, clean up permissions, and build storage credentials. Depending on your cloud provider, you will also need to ensure the IAM roles or Managed Identities have the least required privileges to storage.

It’s always good practice to use parameters in your code; it’s even more critical for UC volume paths.

Important!

Mount points have workspace-level scope, while UC volumes have metastore-level scope. For example, if your development, test, and production workspaces are bound to the same Unity Catalog metastore, their respective volumes must have different names.

Conclusion

We just walked through the basic steps for migrating from mount points to volumes in Unity Catalog. As opposed to the passthrough nature of mount points, volumes - like external tables - ensure fine-grained access control and provide enhanced data governance capabilities including lineage.

You can also decentralize access control by moving it from a central team (in charge of infrastructure management) to the business teams (the actual data owners). By removing friction, you can reduce the time to market for your projects.

As you plan your migration, put your platform users’ experience at the center of everything and ensure business continuity. Then, automate as much as possible when migrating and leveraging the UCX framework.

Databricks Community

How to migrate from mount points to Unity Catalog Volumes

Mount points link remote storage to local storage

Unity Catalog governs volumes

Migration process

Assessment

Planning

Execution

1. Set up storage credentials

2. Set up external locations

3. Create UC Volume

4. Adjust code and clean up

Conclusion

Metadata-Driven ETL Framework in Databricks (Part-1)

Top 10 query performance tuning tips for Databricks Serverless SQL

Best practices for safe data experimentation with Databricks