cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Any way to ignore DLT tables in pipeline

eballinger
New Contributor III

Hello,

In our testing environment we would like to be able to only update the DLT tables we are testing for our pipeline. This would help speed up the testing. We currently have the pipeline code being generated dynamically based on how many tables there are to be processed. 

What I have discovered though is when I run the pipeline without referencing the table, the DLT table gets removed/deleted. Is there any way to declare a DLT table so that it cannot be removed? 

Just so im clear here is a example:

Pretend there at 10 DLT tables in total for our pipeline and it takes 10 minutes to run.

If I only want to update 2 tables in testing I would like to run the pipeline for those 2 tables only and just ignore the others since they are not being updated. This would then take about 2 minutes.   

I know in the pipeline GUI interface there is a button called "Select tables for refresh" and this does exactly what I want, the only difference is I want to do this functionality in the Python code instead since that is where I dynamically declare the DLT tables.  

1 ACCEPTED SOLUTION

Accepted Solutions

Alberto_Umana
Databricks Employee
Databricks Employee

Hi @eballinger.

To address your requirement of updating only specific Delta Live Tables (DLT) in your testing environment without removing the others, you can leverage the @dlt.table decorator and the temporary parameter in your Python code. This approach allows you to create temporary tables that persist only for the lifetime of the pipeline run, thus preventing their removal when not referenced in subsequent runs.

 

Here’s how you can modify your pipeline to achieve this:

  1. Define Temporary Tables: Use the temporary=True parameter in the @dlt.table decorator to create tables that are not removed when not referenced in the pipeline run.
  2. Selective Table Updates: Dynamically generate the pipeline code to include only the tables you want to update. The temporary tables will persist for the duration of the pipeline run and will not be deleted if not referenced in subsequent runs.

Here’s an example of how you can define a temporary table

 

import dlt

 

@dlt.table(temporary=True)

def my_temp_table():

    return spark.read.table("source_table")

 

In your dynamic pipeline generation logic, you can conditionally include or exclude tables based on your testing requirements. This way, you can run the pipeline for only the tables you need to update, and the temporary tables will not be removed if they are not included in the run.

Additionally, you can use the spark.read.table("LIVE.table_name") function to reference tables within the same pipeline, ensuring that the tables are correctly referenced during the pipeline execution

View solution in original post

2 REPLIES 2

Alberto_Umana
Databricks Employee
Databricks Employee

Hi @eballinger.

To address your requirement of updating only specific Delta Live Tables (DLT) in your testing environment without removing the others, you can leverage the @dlt.table decorator and the temporary parameter in your Python code. This approach allows you to create temporary tables that persist only for the lifetime of the pipeline run, thus preventing their removal when not referenced in subsequent runs.

 

Here’s how you can modify your pipeline to achieve this:

  1. Define Temporary Tables: Use the temporary=True parameter in the @dlt.table decorator to create tables that are not removed when not referenced in the pipeline run.
  2. Selective Table Updates: Dynamically generate the pipeline code to include only the tables you want to update. The temporary tables will persist for the duration of the pipeline run and will not be deleted if not referenced in subsequent runs.

Here’s an example of how you can define a temporary table

 

import dlt

 

@dlt.table(temporary=True)

def my_temp_table():

    return spark.read.table("source_table")

 

In your dynamic pipeline generation logic, you can conditionally include or exclude tables based on your testing requirements. This way, you can run the pipeline for only the tables you need to update, and the temporary tables will not be removed if they are not included in the run.

Additionally, you can use the spark.read.table("LIVE.table_name") function to reference tables within the same pipeline, ensuring that the tables are correctly referenced during the pipeline execution

Hi again Alberto,

I just tested your solution and I think I missed what you were saying about it only persists while the pipeline is being run. That might work for some other scenarios, but in my example case above I want all of my 10 DLT tables to exists after the pipeline is ran. So how can I update just 2 tables and ignore the other 8?  Since there is a way to accomplish this with the GUI interface, there should also be some way to accomplish this programmatically?    

Thanks again for your help. 

Eddie

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group