How to use external locations

pernilak — Mon, 18 Mar 2024 14:55:48 GMT

Hi,

I am struggling with truly understanding how to work with external locations. As far as I am able to read, you have:

1) Managed catalogs
2) Managed schemas
3) Managed tables/volumes etc.
4) External locations that contains external tables and/or volumes
5) External volumes that can reside inside managed catalogs/schemas

Most of the time, we want to write data inside of databricks - so managed catalogs, schemas and tables/volums seems natural. However, there are times when we want to write data (that we need to access inside of databricks) outside of databricks. In those cases, I understand that the way to do so, is using external locations.

However, working with external locations afterwards, I don't find straight forward.

For volumes, I like how I can create an external volume inside of a catalog. Then I have my raw catalog, with domain schmas and belonging managed tables end external volumes are organized within. However, when working with tabular data I find it harder to understand what you are supposed to do with it.

Databricks says: "Don't grant general READ FILES [...] permission on external locations to end users". Then how exactly should my users (I am a platform engineer, my users are data engineers, scientists and analysts) access these files? I don't want to do the work of creating managed tables for every table in an external location - when new data appears, those tables must be refreshed with new data. We have a lot of streaming use cases as well. ideally, I want tables to be organized in my catalogs and schemas the same way you can do with external volumes.

topic How to use external locations in Get Started Discussions

How to use external locations