cancel
Showing results for 
Search instead for 
Did you mean: 
Data Governance
cancel
Showing results for 
Search instead for 
Did you mean: 

multiple storage credentials/external locations to same physical location

-werners-
Esteemed Contributor III

Hi all,

we are in the process of rolling out a new unity-enabled databricks env with 2 tiers: dev and prod.

Initially we had the plan to completely decouple dev and prod, each with their own data lake as storage.
While this is the safest option, it does give me some headaches on how to get good data into that dev data lake.
So I started thinking: is it possible to define TWO storage credentials (with different Azure Connectors!) to the same (prod) data lake, define one of the two as read-only and use that for our dev environment.  Create a read-only external location, catalog etc.
While for the prod env we do the same, but write enabled.
Like that we can read prod data without risking writing to the prod data lake.

Is this possible?  I know Unity is quite strict concerning overlap, but perhaps it works with different Access Connectors/storage credentials?

1 ACCEPTED SOLUTION

Accepted Solutions

Kaniz
Community Manager
Community Manager

Hi @-werners-, In Azure Databricks, you can achieve your goal using external locations and storage credentials.

 

Let me break down how this can work:

 

External Locations:

  • An external location combines a cloud storage path (such as your data lake) with a storage credential that authorizes access to that path.
  • These locations are used for creating, reading from, and writing to external tables and volumes.
  • Using external locations, you can also assign managed storage for managed tables and volumes at the catalog or schema level.
  • Importantly, external locations can be used for external tables and managed tables and volumes.
  • For example, you can define storage locations for managed tables and volumes at the catalog and schema levels, overriding the metastore root storage location.

Storage Credentials:

  • A storage credential represents an authentication and authorization mechanism for accessing data stored on your cloud tenant.
  • You can create storage credentials using an Azure-managed identity (strongly recommended) or a service principal.
  • Each storage credential is subject to Unity Catalog access-control policies that control which users and groups can access it.
  • You can mark a storage credential as read-only to prevent users from writing to external locations that use that credential.

Implementation for Dev and Prod:

  • For your dev environment, create an external location with a read-only storage credential pointing to the same data lake.
  • This lets you read data from the prod data lake without risking writes.
  • In your prod environment, create another external location with a write-enabled storage credential for normal operations.
  • By segregating the credentials, you ensure that dev operations don’t accidentally modify prod data.

Unity’s Strictness Concerning Overlap:

  • While Unity strictly avoids overlap, using different storage credentials for read-only and write-enabled access should work.
  • The key is to ensure that the external locations and storage credentials are well-defined and separate.

Remember to follow best practices, such as using managed identities and adhering to access-control policies, to maintain security and governance. Feel free to ask if you have any specific requirements or need further assistance! 🚀🔐

 

For more detailed information, refer to the official Azure Databricks documentation on managing external locations and storage credentials.

View solution in original post

8 REPLIES 8

Kaniz
Community Manager
Community Manager

Hi @-werners-, In Azure Databricks, you can achieve your goal using external locations and storage credentials.

 

Let me break down how this can work:

 

External Locations:

  • An external location combines a cloud storage path (such as your data lake) with a storage credential that authorizes access to that path.
  • These locations are used for creating, reading from, and writing to external tables and volumes.
  • Using external locations, you can also assign managed storage for managed tables and volumes at the catalog or schema level.
  • Importantly, external locations can be used for external tables and managed tables and volumes.
  • For example, you can define storage locations for managed tables and volumes at the catalog and schema levels, overriding the metastore root storage location.

Storage Credentials:

  • A storage credential represents an authentication and authorization mechanism for accessing data stored on your cloud tenant.
  • You can create storage credentials using an Azure-managed identity (strongly recommended) or a service principal.
  • Each storage credential is subject to Unity Catalog access-control policies that control which users and groups can access it.
  • You can mark a storage credential as read-only to prevent users from writing to external locations that use that credential.

Implementation for Dev and Prod:

  • For your dev environment, create an external location with a read-only storage credential pointing to the same data lake.
  • This lets you read data from the prod data lake without risking writes.
  • In your prod environment, create another external location with a write-enabled storage credential for normal operations.
  • By segregating the credentials, you ensure that dev operations don’t accidentally modify prod data.

Unity’s Strictness Concerning Overlap:

  • While Unity strictly avoids overlap, using different storage credentials for read-only and write-enabled access should work.
  • The key is to ensure that the external locations and storage credentials are well-defined and separate.

Remember to follow best practices, such as using managed identities and adhering to access-control policies, to maintain security and governance. Feel free to ask if you have any specific requirements or need further assistance! 🚀🔐

 

For more detailed information, refer to the official Azure Databricks documentation on managing external locations and storage credentials.

-werners-
Esteemed Contributor III

That is awesome.  I was not sure it would work like that but apparently it does.

Tnx!

Kaniz
Community Manager
Community Manager

Thank you @-werners- ! 
I'm glad it helped.

-werners-
Esteemed Contributor III

Unfortunately it does not seem to work.  When creating a second external location to the same path, I get the dreaded error saying an external location already exists on that path.  And that is exactly what I want to do 😞

karthik_p
Esteemed Contributor

Are you able to resolve this, we are in similar situation. We want to create 2 catalogs and provide different permissions, we are getting overlap error

we we thought to go with above approach. How to handle for DLT UC. We created 2 catalogs one is using default metastore and other using altogether new storage and container and catalog level segregation with managed location, when we execute we are getting overlap error with managed stare for 2 nd catalog @-werners- @Kaniz 

-werners-
Esteemed Contributor III

I could apply 'force create' but it does seem weiry to do so.

Wojciech_BUK
Contributor III

I think you are overcomplicating it a bit.

You can have 2 workspace prod and dev and 2 catalogs prod and dev.

You can make prod catalog to be read only in dev environment or you can do shallow clones from prod to dev with some scripts.

Additionally you have ACLs over external locations and catalogs, so let's say you have engineers and analys. You grant access to create objects in catalog and wrute table to ext location on dev to engineers and you grand read to analyst on prod.

If you screw ACL on Unity , the backend setup does not matter.

I allow for write operation on prod only to service principals ( via jobs)

I used above security in few Unity enabled project and no surprise so far.

-werners-
Esteemed Contributor III

That is the way I am working at right now.  Assign workspace to the catalog and set to read-only if necessary.
It would be easier though if it was possible to define a 2nd external location in read-only, as this cannot break anything (of course in read-only mode).

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.