cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Implementing Full Load Strategy with Delta Live Tables and Unity Catalog

yvishal519
Contributor

Hello Databricks Community,

I am seeking guidance on handling full load scenarios with Delta Live Tables (DLT) and Unity Catalog. Hereโ€™s the situation Iโ€™m dealing with:

We have a data folder in Azure Data Lake Storage (ADLS) where we use Auto Loader to ingest the latest data into a Delta Live Table. The data is loaded as a full load from the source system on a daily basis. However, there are two key challenges:

  1. Record Deletion: Records may be deleted in the source system, but these deletions are not reflected in our DLT tables because Auto Loader only captures new and modified records.

  2. Overwriting Data: Since the source system performs full loads daily and there are no primary columns or unique identifiers in the source data, we need to overwrite the existing data in the DLT table to accurately reflect the current state of the source.

Given that we are using Unity Catalog with our DLT tables, I would like to understand the best practices for implementing a full load strategy that allows us to overwrite the entire dataset in our Delta tables. Specifically, I am looking for guidance on:

  • How to effectively overwrite data in a Delta Live Table when the source system performs full loads.
  • Strategies to ensure that deleted records from the source system are also removed from the Delta table.

Any insights or examples on how to achieve this would be greatly appreciated.

Thank you!

1 ACCEPTED SOLUTION

Accepted Solutions

yvishal519
Contributor

To efficiently manage full data loads, we can leverage a regex pattern to dynamically identify the latest data folders within our bronze layer. These folders typically contain the most recent data updates for our tables. By using a Python script, we can pinpoint the correct folder path and integrate it with our Auto Loader function. This ensures that each pipeline run processes only the freshest data.

With this setup, Delta Live Tables will conduct a full refresh of the tables every time the pipeline executes, writing the latest data to the DLT tables. To streamline this process, we will configure Databricks workflows to handle full refresh operations. These workflows will automate and schedule the full data loads, ensuring consistency and reliability. Iโ€™ve rigorously tested this approach, and it has proven effective across various scenarios.

View solution in original post

5 REPLIES 5

Kaniz_Fatma
Community Manager
Community Manager

Hi @yvishal519

  1. To overwrite the entire dataset in your Delta Live Table, you can use the overwrite mode in your write operations. This will ensure that the existing data in the Delta table is replaced with the new data from the so....
  2. To ensure that deleted records from the source system are also removed from the Delta table, you can implement a merge operation. This involves using the MERGE statement to synchronize the Delta table with the source data. This approach ensures that any records deleted in the source system are also removed from the Delta ...e.
  3. When using Unity Catalog, make sure your Delta Live Tables are registered in the catalog. This helps in managing permissions and ensuring data governance. 
  4. Please enable schema evolution to handle changes in the schema of your source data. You can do this by setting .option("mergeSchema", "true").
  5. Try using partitioning and Z-ordering to optimize read and write performance.
  6. At last, do utilize Databricksโ€™ built-in monitoring tools to keep track of your DLT pipelines and ensure they are running smoothly. 

Kaniz_Fatma
Community Manager
Community Manager

Hi @yvishal519, Thank you for reaching out to our community! We're here to help you.

To ensure we provide you with the best support, could you please take a moment to review the response and choose the one that best answers your question? Your feedback not only helps us assist you better but also benefits other community members who may have similar questions in the future.

If you found the answer helpful, consider giving it a kudo. If the response fully addresses your question, please mark it as the accepted solution. This will help us close the thread and ensure your question is resolved.

We appreciate your participation and are here to assist you further if you need it!

yvishal519
Contributor

Hi @Kaniz_Fatma,

Could you please provide any sample code? It would be a great help in implementing full load data with Unity Catalog in DLT, as I'm having difficulty doing so using the provided link.

yvishal519
Contributor

@Kaniz_Fatma  I Tried above methods but still its appending records not overwriting as full load, any other way to overwrite existing dlt records with latest data 

yvishal519
Contributor

To efficiently manage full data loads, we can leverage a regex pattern to dynamically identify the latest data folders within our bronze layer. These folders typically contain the most recent data updates for our tables. By using a Python script, we can pinpoint the correct folder path and integrate it with our Auto Loader function. This ensures that each pipeline run processes only the freshest data.

With this setup, Delta Live Tables will conduct a full refresh of the tables every time the pipeline executes, writing the latest data to the DLT tables. To streamline this process, we will configure Databricks workflows to handle full refresh operations. These workflows will automate and schedule the full data loads, ensuring consistency and reliability. Iโ€™ve rigorously tested this approach, and it has proven effective across various scenarios.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group