cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

DLT bronze tables

Faisal
Contributor

I am trying to ingest incremental parquet files data to bronze streaming table, how much history data should be retained ideally in bronze layer as a general best practise considering I will be only using bronze to ingest source data and move it to silver streaming tables using APPLY_CHANGES_INTO?

2 REPLIES 2

Kaniz_Fatma
Community Manager
Community Manager

Hi @Faisal , 

As a general best practice, you should retain as much history data in the Bronze layer as is necessary to ensure data quality and accuracy.

One way to decide on the retention period could be to consider the following factors:

  1. Reconciliation and Auditing: Retain enough data to support any reconciliation, auditing, or compliance checks that may be necessary. This will depend on the regulatory or business requirements of your organization.

  2. Data Latency: Retain enough data to maintain a sufficient window of time for your data pipelines to capture and process batch and streaming updates. This will depend on the overall latency requirements and your data pipeline architecture.

  3. Data Size and Cost: Retain enough data so downstream consumers like Silver tables don't miss relevant updates. However, the amount of data should be reasonable enough to avoid incurring unnecessary storage costs.

Based on these factors, storing as much history data as necessary to meet your business requirements is best. In general, you should aim to retain at least a few days or weeks worth of data to provide an excellent window to capture the incremental updates, depending on the frequency of the IO and the rate of data accumulation.

However, also be mindful of the impact of having too much historical data on query performance and data processing times for your data applications. In any case, the ingestion of the history data should only be done once, and after that, only the incremental changes should be captured. The data retention policy can be revised over time based on changes to your business, regulatory or performance requirements.

MuthuLakshmi
New Contributor III
New Contributor III

The amount of history data that should be retained in the bronze layer depends on your specific use case and requirements. As a general best practice, you should retain enough history data to support your downstream analytics and machine learning workloads, while also considering the cost and performance implications of storing and processing large amounts of data.

One approach to managing historical data in the bronze layer is to use partitioning and time-based data retention policies. For example, you can partition your data by date or time, and then use a retention policy to automatically delete or archive old partitions after a certain period of time. This can help you manage the size of your data lake and reduce storage costs, while still retaining enough historical data to support your use cases.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group