cancel
Showing results for 
Search instead for 
Did you mean: 
Data Governance
cancel
Showing results for 
Search instead for 
Did you mean: 

Unity Catalog with unstructured "lake" data? (.png, .ggzip, .zarr)

YSF
New Contributor III

I have images, zipped files, and arbitrary binary files in ADLS gen 2.

In Unity Catalog I created a storage credential and an external location to the container that contains these files in directories.

Is there any way I can view them within Unity Catalog without coercing them into a parquet file?

Also once the external location is created, I was also hoping I'd be able to simply read the files like:

```

with open('abfss://path/set/in/external/location/data.zarr') as f:

data = f.read()

```

I tried doing this: https://aboutdataai.com.au/tag/unity-catalog/ but get a 403 when I try to dbutils.fs.ls() the contents of the directory which I'm not sure why because I have granted access to myself for that external storage location.

I also was not open to directly open and read a text file in that same location and got the error the file doesn't exist.

Any help on this would be appreciated.

For those that want more detail, I am trying to work with unstructured data in a data lake using databricks. The 'mount' methods I'm told are being discouraged because they will be deprecated soon. So I'm trying to see how this new world of unity catalog behaves in relation to unstructured data. Unfortunately so far it certainly feels like the new changes have turned Delta Lake into Delta Warehouse.

4 REPLIES 4

Aviral-Bhardwaj
Esteemed Contributor III

I tested this using AWS S3 it is working fine for me,please check your code again

YSF
New Contributor III

In the Data Explorer, in my external location, the URL is:

abfss://<container>@<storage_account>.dfs.core.windows.net

When I click "Test Connection" it says all permissions confirmed and has full privileges.

In my ADLS I have a text file that is inside the container in a folder so the URI should be like so:

abfss://<container>@<storage_account>.dfs.core.windows.net/my_folder/my_file.txt

When I run this:

dbutils.fs.ls("abfss://<container>@<storage_account>.dfs.core.windows.net/")

I get:

ErrorClass=INVALID_PARAMETER_VALUE.LOCATION_OVERLAP] Input path url ___ overlaps with managed storage

When I run this:

dbutils.fs.ls("abfss://<container>@<storage_account>.dfs.core.windows.net/my_folder/")

I get a 403 request is not authorized.

I am using a single node cluster, with single user access mode, with DBR 11.3 LTS, and I have full grants to my user on the storage credential and the external location in unity catalog.

I hope this extra information helps. I'm not clear on what's going wrong.

chiayui
New Contributor II
New Contributor II

Hi, it is a known bug that External Locations do not work for Azure Storage Account root locations. Please create a folder in the root location and use that as the external location.

With the External Location and its permissions properly set up, you are able to read its content into Spark dataframes using the URI ('abfss://...'). Databricks runtime supports the binary file data source.

Reading the file using the lines of code you provided (with open('abfss://...') as f: ...) is not supported yet.

Aviral-Bhardwaj
Esteemed Contributor III

@Chia-Yui Lee​  Thanks for this information, I was not aware about this

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.