cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Where are materialized view generated by Delta Live Table stored?

ruoyuqian
New Contributor II

I am trying to compare tables created by DBT in Catalog vs the materialized view generated by Delta Live Table, and I noticed that the dbt generated table has Storage Location information and It points to a physical storage location however the materialized view does not show the location. 

6 REPLIES 6

Shazaamzaa
New Contributor III

It's an interesting question. Based on the below limitation listed in their official docs, I'm assuming the location of the file is purposefully not displayed in the metadata. Would be interested to know the answer though. 

Ref:- Use materialized views in Databricks SQL | Databricks on AWS

  • The underlying files supporting materialized views might include data from upstream tables (including possible personally identifiable information) that do not appear in the materialized view definition. This data is automatically added to the underlying storage to support incremental refreshing of materialized views. Because the underlying files of a materialized view might risk exposing data from upstream tables not part of the materialized view schema, Databricks recommends not sharing the underlying storage with untrusted downstream consumers. For example, suppose the definition of a materialized view includes a COUNT(DISTINCT field_a) clause. Even though the materialized view definition only includes the aggregate COUNT DISTINCT clause, the underlying files will contain a list of the actual values of field_a.

TamD
Contributor

I would also like to know the answer to this.  When I drop a mat view from Unity Catalog, I want to be sure the storage is also cleared so I'm not getting billed for it.

radothede
Contributor II

I'm not sure about that, but I would check managed storage locations, in that order:

1) schema managed storage location,

2) catalog managed storage location,,

3) metastore managed storage location,.

with ref. to managed storage docs:

The managed storage location that stores data and metadata for managed tables and managed volumes uses the following rules:

  • If the containing schema has a managed location, the data is stored in the schema managed location.

  • If the containing schema does not have a managed location but the catalog has a managed location, the data is stored in the catalog managed location.

  • If neither the containing schema nor the containing catalog have a managed location, data is stored in the metastore managed location.

MartinIsti
New Contributor III

To the best of my knowledge DLT is a mini ETL system by its own. It took me quite a bit of time to get used to it as it is similar yet very different than your solution consisting of jobs and notebooks.

As it has ownership of the target "tables" (materialized views) it manages them and one step further inside Databricks' control and not in the hands of the developer.

I'm sure the objects are created in the managed storage but I don't know if you can easily find them or if they are meant to be found.

Once you delete the pipeline I assume that either the files are deleted straight away or after 7 days.

A somewhat convoluted way to find them in the storage layer is to have an empty Databricks environment where the first thing you create is a materialized view and this should narrow down the set of files you need to look through.

TamD
Contributor

It would be good if Databricks would confirm the behaviour, rather than requiring us to make assumptions like this.

MartinIsti
New Contributor III

Couldn't agree more, TamD. My impression is that DLT (that has strong ties with LakeFlow Connect) is being heavily developed right now and the documentation is not necessarily keeping up with the actual code-development. It should and it's not a good enough excuse if this is indeed the reason but that's realistic.

I would think with new features many current forms behaviour can easily change and this storage might be one of them. Especially when you think of DPM (Direct Publishing Mode) of DLT where you can dynamically specify target catalogs and schemas.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group