Databricks Community

eballinger · ‎12-02-2024

Hi Guys,

I am new to this community. I am guessing we have a typical setup (DLT tables, 3 layers - bronze, silver and gold) and while it works fine in our development environment I have always looked for ways to speed things up for testers.

For example right now it takes 2-3 hours to run our delta pipeline (~80 tables) in development. If I wanted to update just 1 record in one table I have to run the entire pipeline again and it take around the same amount of time. Thats a lot of time waiting for one test case.

Since we have our landing to raw DLT code dynamically driven from a table list I was hoping we could just process 1 table when we wanted to just update and test 1 table. However I discovered once you undeclare a DLT table it removes it.

So my question is it there any way to tell a DLT table not to be removed ever even when its not referenced in the pipeline anymore? This way I could just update different tables in development (ignoring the others when not testing those) and save a lot of time.

Also any other tips to speed up testing a small number of tables/records would be appreciated.

Thanks
Ed

Walter_C · ‎12-02-2024

There isn't a direct way to achieve this within the current DLT framework. When a DLT table is undeclared, it is designed to be removed from the pipeline, which includes the underlying data.

However, there are a few strategies you can consider to speed up your testing process and manage your tables more effectively:

Selective Table Processing: Instead of running the entire pipeline, you can create a separate, smaller pipeline specifically for testing purposes. This pipeline would only include the tables you need to test. This way, you can avoid the overhead of processing all 80 tables and focus only on the ones that are relevant to your current test case.
Incremental Updates: If your testing involves updating a small number of records, consider using incremental updates. This approach allows you to process only the changes since the last update, which can significantly reduce the processing time.
Snapshot Isolation: Use snapshot isolation to create a stable view of your data at a specific point in time. This can help you test changes without affecting the entire dataset. You can create a snapshot of the table you want to test, make your updates, and then compare the results against the snapshot.
Parallel Processing: If your development environment supports it, consider running multiple instances of your pipeline in parallel. This can help distribute the load and reduce the overall processing time.
Caching Intermediate Results: Cache intermediate results of your pipeline stages. This can help avoid reprocessing the same data multiple times and speed up the overall pipeline execution.
Optimizing Pipeline Configuration: Review and optimize your pipeline configuration. Ensure that you are using the appropriate cluster size and configuration for your workload. Sometimes, increasing the cluster size or using a more powerful instance type can significantly reduce processing time.

eballinger · ‎12-03-2024

Thanks Walter. All good suggestions.

We have tried your 1 and 2 suggestions already. The problem is testers need to test different tables so this first strategy has limits. But I will look into your other suggestions as well. I know you can now run the pipelines in serverless mode as well but we have not done that yet. I am curious how that could impact the runtime. Thanks again

Databricks Community

Looking for ways to speed up DLT testing

Connect with Databricks Users in Your Area

Databricks Named a Leader in the 2024 Gartner® Magic Quadrant™ for Cloud Database Management Systems

Announcing the new Meta Llama 3.3 model on Databricks

Milestone: DatabricksTV Reaches 100 Videos!

Dotmatics and Databricks Partner to Advance Scientific Intelligence in Life Sciences

Databricks Community Champion - December 2024 - Sujesh Menon