Monday
Hello,
What are the benefits of not "registering" Raw data into Unity Catalog when the data in Raw will be in its original format, such as .csv, .json, .parquet, etc?
An example scenario could be:
I thought that the data in Raw should be in Catalog for various benefits (as stated in the documentation). But what would be the benefits of not adding them to Catalog?
Thank you,
N.Z.
Tuesday
@cdn_yyz_yul wrote:But I had received a proposal which suggested a scenario of not "cataloging" Raw, instead, using another tool to achieve the need of searching files in Raw.
I would like to understand if there are benefits in doing so, from the community.
This is where "it depends" on what your company's setup is, but maybe I can provide some food for thought. Do you already have this other tool, or is it a new purchase? Did this proposal come from a tool vendor or a consultant, or from your CTO? Do you have an ongoing need to search the raw files which would require an additional tool? What business capabilities is this tool going to fulfill--just cataloging and searching, or is it a governance tool like Atlan/Alation/Collibra?
Under most use cases you would not need an additional tool to search raw files. You'd either transform the data or create table metadata in Unity Catalog from the files and work with the files directly. The advantage to Unity Catalog is you have all of the same security settings and data classifications, a familiar UI, and only one thing to administer.
We're in the second year of our Databricks implementation. My approach has been to wait and see if an actual need arises, and if Databricks doesn't come out with a feature which solves my need. We saw some really nice tools at Data+AI Summit, shiny new things are easy to get excited about, but I always assess the actual business need and any lack of features before I expand.
If it's not abundantly clear why you need this extra tooling, ask back "what business capabilities we are realizing, or what features are we lacking". If the features of the tool overlap features in Databricks, my experience in using Databricks has been positive largely because of the integration and simple management. Hope that helps.
Monday
No, leave the raw out of the catalog. I'd recommend doing something a little different.
Consider the zip files as "landed", not "raw". Consider the raw data is the unzipped data.
In your schema in the bronze layer, configure an external location for the raw data. In that same schema, make tables (can be DLT or just regular Delta Tables, depends on your needs) and load the raw data from the external location into the bronze tables. No modification, just flatten the data. This is your working bronze layer. This will take a little extra work up front, but tables will give you much better performance than working directly with the raw json/csv/xml/whatever, and will also give you access with permissions governance over the raw data. Then do your bronze>>silver and silver>>gold as usual.
Monday
Thanks @Rjdudley
I meant to say, the scenario is:
I agree with your reply, myself. But I had received a proposal which suggested a scenario of not "cataloging" Raw, instead, using another tool to achieve the need of searching files in Raw.
I would like to understand if there are benefits in doing so, from the community.
Tuesday
@cdn_yyz_yul wrote:But I had received a proposal which suggested a scenario of not "cataloging" Raw, instead, using another tool to achieve the need of searching files in Raw.
I would like to understand if there are benefits in doing so, from the community.
This is where "it depends" on what your company's setup is, but maybe I can provide some food for thought. Do you already have this other tool, or is it a new purchase? Did this proposal come from a tool vendor or a consultant, or from your CTO? Do you have an ongoing need to search the raw files which would require an additional tool? What business capabilities is this tool going to fulfill--just cataloging and searching, or is it a governance tool like Atlan/Alation/Collibra?
Under most use cases you would not need an additional tool to search raw files. You'd either transform the data or create table metadata in Unity Catalog from the files and work with the files directly. The advantage to Unity Catalog is you have all of the same security settings and data classifications, a familiar UI, and only one thing to administer.
We're in the second year of our Databricks implementation. My approach has been to wait and see if an actual need arises, and if Databricks doesn't come out with a feature which solves my need. We saw some really nice tools at Data+AI Summit, shiny new things are easy to get excited about, but I always assess the actual business need and any lack of features before I expand.
If it's not abundantly clear why you need this extra tooling, ask back "what business capabilities we are realizing, or what features are we lacking". If the features of the tool overlap features in Databricks, my experience in using Databricks has been positive largely because of the integration and simple management. Hope that helps.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group