cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Clearing data stored by pipelines

gabrieleladd
New Contributor II

Hi everyone! I'm new to Databricks and moving my first steps with Delta Live Tables, so please forgive my inexperience. I'm building my first DLT pipeline and there's something that I can't really grasp: how to clear all the objects generated or updated by the pipeline execution (tables and metadata). I'll probably need to make some changes and adds over time as my understanding of the subject progresses, and I'd like to be able to rerun the pipeline from scratch and reprocess all the data (I'm simulating the data stream and I trigger the data inflow).

I understand (correct me if I'm wrong) that streaming live tables have a way of avoiding reprocessing files in the cloud_files() directory by storing some data about files that have already been processed, but, while I believe a simple DROP would do for the data tables, I can't imagine how to get a completely clean slate considering all the extra data being stored when the pipeline is run.

Thanks for your help 🙂

3 REPLIES 3

Lakshay
Databricks Employee
Databricks Employee

If you want to reprocess all the data, you can simply for a "Full Refresh" option in the DLT pipeline.

You can read more about it here: https://docs.databricks.com/en/delta-live-tables/updates.html#how-delta-live-tables-updates-tables-a...

Thank you for your answer @Lakshay  🙂 

I am aware of the "full refresh" option, and I wasn't considering it because I didn't think it could solve all my issues at once. In fact, I thought it would work for updates and adds (e.g. changes in columns) provided that it overwrites all the tables and metadata already in place and reprocesses all the files in the cloud_files() directory.

On the other hand, my doubt is that this solution couldn't fulfil my potential need for complete deletion of some of the tables with the related metadata, unless "refresh all" means that if I remove the definition of a table from my pipeline code the table and its related metadata are removed from the target directory and schema.

ChKing
New Contributor II

To clear all objects generated or updated by the DLT pipeline, you can drop the tables manually using the DROP command as you've mentioned. However, to get a completely clean slate, including metadata like the tracking of already processed files in the cloud_files() directory, you'll need to consider both the data tables and the metadata stored in checkpoint directories.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group