cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Clearing data stored by pipelines

gabrieleladd
New Contributor II

Hi everyone! I'm new to Databricks and moving my first steps with Delta Live Tables, so please forgive my inexperience. I'm building my first DLT pipeline and there's something that I can't really grasp: how to clear all the objects generated or updated by the pipeline execution (tables and metadata). I'll probably need to make some changes and adds over time as my understanding of the subject progresses, and I'd like to be able to rerun the pipeline from scratch and reprocess all the data (I'm simulating the data stream and I trigger the data inflow).

I understand (correct me if I'm wrong) that streaming live tables have a way of avoiding reprocessing files in the cloud_files() directory by storing some data about files that have already been processed, but, while I believe a simple DROP would do for the data tables, I can't imagine how to get a completely clean slate considering all the extra data being stored when the pipeline is run.

Thanks for your help 🙂

2 REPLIES 2

Lakshay
Esteemed Contributor
Esteemed Contributor

If you want to reprocess all the data, you can simply for a "Full Refresh" option in the DLT pipeline.

You can read more about it here: https://docs.databricks.com/en/delta-live-tables/updates.html#how-delta-live-tables-updates-tables-a...

Thank you for your answer @Lakshay  🙂 

I am aware of the "full refresh" option, and I wasn't considering it because I didn't think it could solve all my issues at once. In fact, I thought it would work for updates and adds (e.g. changes in columns) provided that it overwrites all the tables and metadata already in place and reprocesses all the files in the cloud_files() directory.

On the other hand, my doubt is that this solution couldn't fulfil my potential need for complete deletion of some of the tables with the related metadata, unless "refresh all" means that if I remove the definition of a table from my pipeline code the table and its related metadata are removed from the target directory and schema.

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!