Databricks Community

Joezhu · ‎06-28-2023

Why do we need tiers of data? Why can't we just have all the data go to one tier and just work off of that?

hdhax · ‎06-28-2023

Here are a few reasons data tiers are needed.

1. Performance Optimization: Different tiers of data allow for optimized performance based on the specific needs of each tier. For example, high-priority or frequently accessed data can be stored in a high-performance tier with faster access times and processing capabilities. This ensures that critical data is readily available and can be processed quickly, resulting in improved operational efficiency.

2. Resource Allocation: Data tiers enable organizations to allocate resources such as storage, computing power, and bandwidth more efficiently. Not all data requires the same level of resources. By segregating data into different tiers, organizations can match resource allocation to the specific needs of each tier.
3. Data Retention Policies: Different types of data may have varying retention requirements based on legal, compliance, or business needs. Tiers of data facilitate the implementation of data retention policies.

View solution in original post

01_binary · ‎06-28-2023

As per databricks documentation, goal of incrementally and progressively improving the structure and quality of data as it flows through each layer of the architecture.

Most of the times, raw data is not useful and need to be cleaned or supplemented with other data set.

we can store it in one layer, but it’s easier to understand and manage if those are kept Separate. This can be done logical or physical. It’s really your choice. You are going to find use cases where someone might need to access bronze data for their gold use cases. There can be some data quality issue with it.

hdhax · ‎06-28-2023

Here are a few reasons data tiers are needed.

1. Performance Optimization: Different tiers of data allow for optimized performance based on the specific needs of each tier. For example, high-priority or frequently accessed data can be stored in a high-performance tier with faster access times and processing capabilities. This ensures that critical data is readily available and can be processed quickly, resulting in improved operational efficiency.

2. Resource Allocation: Data tiers enable organizations to allocate resources such as storage, computing power, and bandwidth more efficiently. Not all data requires the same level of resources. By segregating data into different tiers, organizations can match resource allocation to the specific needs of each tier.
3. Data Retention Policies: Different types of data may have varying retention requirements based on legal, compliance, or business needs. Tiers of data facilitate the implementation of data retention policies.

Vishwas · ‎06-28-2023

In addition to the reasons mentioned such as resource allocation, performance optimization and retention, there are also aspects of data curation that are to be considered here.

The bronze layer is often very close to the source that enables replay-ability as well as a point for debugging when upstream systems aren't accesible. The silver layer enables deduplication and curation per enterprise needs, the base copy is still available in bronze for access as required.

The gold layer enables data blendin, look-up and enrichment of datasets for various use cases