Databricks Community

gabrieleladd · ‎05-03-2024

Hi everyone! I'm new to Databricks and moving my first steps with Delta Live Tables, so please forgive my inexperience. I'm building my first DLT pipeline and there's something that I can't really grasp: how to clear all the objects generated or updated by the pipeline execution (tables and metadata). I'll probably need to make some changes and adds over time as my understanding of the subject progresses, and I'd like to be able to rerun the pipeline from scratch and reprocess all the data (I'm simulating the data stream and I trigger the data inflow).

I understand (correct me if I'm wrong) that streaming live tables have a way of avoiding reprocessing files in the cloud_files() directory by storing some data about files that have already been processed, but, while I believe a simple DROP would do for the data tables, I can't imagine how to get a completely clean slate considering all the extra data being stored when the pipeline is run.

Thanks for your help 🙂

Lakshay · ‎05-03-2024

If you want to reprocess all the data, you can simply for a "Full Refresh" option in the DLT pipeline.

You can read more about it here: https://docs.databricks.com/en/delta-live-tables/updates.html#how-delta-live-tables-updates-tables-a...

gabrieleladd · ‎05-03-2024

Thank you for your answer @Lakshay 🙂

I am aware of the "full refresh" option, and I wasn't considering it because I didn't think it could solve all my issues at once. In fact, I thought it would work for updates and adds (e.g. changes in columns) provided that it overwrites all the tables and metadata already in place and reprocesses all the files in the cloud_files() directory.

On the other hand, my doubt is that this solution couldn't fulfil my potential need for complete deletion of some of the tables with the related metadata, unless "refresh all" means that if I remove the definition of a table from my pipeline code the table and its related metadata are removed from the target directory and schema.

ChKing · ‎09-22-2024

To clear all objects generated or updated by the DLT pipeline, you can drop the tables manually using the DROP command as you've mentioned. However, to get a completely clean slate, including metadata like the tracking of already processed files in the cloud_files() directory, you'll need to consider both the data tables and the metadata stored in checkpoint directories.