cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Shoud data in Raw /Bronze be in Catalog?

cdn_yyz_yul
New Contributor II

Hello,

What are the benefits of not "registering" Raw data into Unity Catalog when the data in Raw will be in its original format, such as .csv, .json, .parquet, etc?

An example scenario could be:

  • Data arrives at Landing as .zip; 
  • The zip will be verified for correctness, and saved to Raw as-is, in a pre-defined folder structure. 
  • Unity Catalog will not know these files.
  • The next layer (Silver) will be in Catalog

I thought that the data in Raw should be in Catalog for various benefits (as stated in the documentation). But what would be the benefits of not adding them to Catalog? 

Thank you,

N.Z.

 

 

1 ACCEPTED SOLUTION

Accepted Solutions

Rjdudley
Valued Contributor II

@cdn_yyz_yul wrote:

But I had received a proposal which suggested a scenario of not "cataloging" Raw, instead, using another tool to achieve the need of searching files in Raw. 
I would like to understand if there are benefits in doing so,  from the community. 


This is where "it depends" on what your company's setup is, but maybe I can provide some food for thought.  Do you already have this other tool, or is it a new purchase?  Did this proposal come from a tool vendor or a consultant, or from your CTO?  Do you have an ongoing need to search the raw files which would require an additional tool?  What business capabilities is this tool going to fulfill--just cataloging and searching, or is it a governance tool like Atlan/Alation/Collibra?

Under most use cases you would not need an additional tool to search raw files.  You'd either transform the data or create table metadata in Unity Catalog from the files and work with the files directly.  The advantage to Unity Catalog is you have all of the same security settings and data classifications, a familiar UI, and only one thing to administer.

We're in the second year of our Databricks implementation.  My approach has been to wait and see if an actual need arises, and if Databricks doesn't come out with a feature which solves my need.  We saw some really nice tools at Data+AI Summit, shiny new things are easy to get excited about, but I always assess the actual business need and any lack of features before I expand.

If it's not abundantly clear why you need this extra tooling, ask back "what business capabilities we are realizing, or what features are we lacking".  If the features of the tool overlap features in Databricks, my experience in using Databricks has been positive largely because of the integration and simple management.  Hope that helps.

View solution in original post

3 REPLIES 3

Rjdudley
Valued Contributor II

No, leave the raw out of the catalog.  I'd recommend doing something a little different.

Consider the zip files as "landed", not "raw".  Consider the raw data is the unzipped data.

In your schema in the bronze layer, configure an external location for the raw data.  In that same schema, make tables (can be DLT or just regular Delta Tables, depends on your needs) and load the raw data from the external location into the bronze tables.  No modification, just flatten the data.  This is your working bronze layer.  This will take a little extra work up front, but tables will give you much better performance than working directly with the raw json/csv/xml/whatever, and will also give you access with permissions governance over the raw data.  Then do your bronze>>silver and silver>>gold as usual.

cdn_yyz_yul
New Contributor II

Thanks @Rjdudley 

I meant to say, the scenario is:

  • Data arrives at Landing as .zip;   
  • The zip will be verified for correctness, and then unzipped, the extracted files will be saved to Raw as-is, in a pre-defined folder structure. 
  • Unity Catalog will not know these files.
  • The next layer (Silver) will be in Catalog

I agree with your reply, myself. But I had received a proposal which suggested a scenario of not "cataloging" Raw, instead, using another tool to achieve the need of searching files in Raw. 
I would like to understand if there are benefits in doing so,  from the community. 

Rjdudley
Valued Contributor II

@cdn_yyz_yul wrote:

But I had received a proposal which suggested a scenario of not "cataloging" Raw, instead, using another tool to achieve the need of searching files in Raw. 
I would like to understand if there are benefits in doing so,  from the community. 


This is where "it depends" on what your company's setup is, but maybe I can provide some food for thought.  Do you already have this other tool, or is it a new purchase?  Did this proposal come from a tool vendor or a consultant, or from your CTO?  Do you have an ongoing need to search the raw files which would require an additional tool?  What business capabilities is this tool going to fulfill--just cataloging and searching, or is it a governance tool like Atlan/Alation/Collibra?

Under most use cases you would not need an additional tool to search raw files.  You'd either transform the data or create table metadata in Unity Catalog from the files and work with the files directly.  The advantage to Unity Catalog is you have all of the same security settings and data classifications, a familiar UI, and only one thing to administer.

We're in the second year of our Databricks implementation.  My approach has been to wait and see if an actual need arises, and if Databricks doesn't come out with a feature which solves my need.  We saw some really nice tools at Data+AI Summit, shiny new things are easy to get excited about, but I always assess the actual business need and any lack of features before I expand.

If it's not abundantly clear why you need this extra tooling, ask back "what business capabilities we are realizing, or what features are we lacking".  If the features of the tool overlap features in Databricks, my experience in using Databricks has been positive largely because of the integration and simple management.  Hope that helps.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group