cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Looking for ways to speed up DLT testing

eballinger
Visitor

Hi Guys,

I am new to this community. I am guessing we have a typical setup (DLT tables, 3 layers - bronze, silver and gold) and while it works fine in our development environment I have always looked for ways to speed things up for testers. 

For example right now it takes 2-3 hours to run our delta pipeline (~80 tables) in development. If I wanted to update just 1 record in one table I have to run the entire pipeline again and it take around the same amount of time. Thats a lot of time waiting for one test case.  

Since we have our landing to raw DLT code dynamically driven from a table list I was hoping we could just process 1 table when we wanted to just update and test 1 table. However I discovered once you undeclare a DLT table it removes it. 

So my question is it there any way to tell a DLT table not to be removed ever even when its not referenced in the pipeline anymore? This way I could just update different tables in development (ignoring the others when not testing those) and save a lot of time.

Also any other tips to speed up testing a small number of tables/records would be appreciated.  

Thanks
Ed

 

 

1 REPLY 1

Walter_C
Databricks Employee
Databricks Employee

There isn't a direct way to achieve this within the current DLT framework. When a DLT table is undeclared, it is designed to be removed from the pipeline, which includes the underlying data.

However, there are a few strategies you can consider to speed up your testing process and manage your tables more effectively:

  1. Selective Table Processing: Instead of running the entire pipeline, you can create a separate, smaller pipeline specifically for testing purposes. This pipeline would only include the tables you need to test. This way, you can avoid the overhead of processing all 80 tables and focus only on the ones that are relevant to your current test case.

  2. Incremental Updates: If your testing involves updating a small number of records, consider using incremental updates. This approach allows you to process only the changes since the last update, which can significantly reduce the processing time.

  3. Snapshot Isolation: Use snapshot isolation to create a stable view of your data at a specific point in time. This can help you test changes without affecting the entire dataset. You can create a snapshot of the table you want to test, make your updates, and then compare the results against the snapshot.

  4. Parallel Processing: If your development environment supports it, consider running multiple instances of your pipeline in parallel. This can help distribute the load and reduce the overall processing time.

  5. Caching Intermediate Results: Cache intermediate results of your pipeline stages. This can help avoid reprocessing the same data multiple times and speed up the overall pipeline execution.

  6. Optimizing Pipeline Configuration: Review and optimize your pipeline configuration. Ensure that you are using the appropriate cluster size and configuration for your workload. Sometimes, increasing the cluster size or using a more powerful instance type can significantly reduce processing time.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group