cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Vacuum on external tables that we mount on ADLS

ravikanthranjit
New Contributor III

Want to know the best process of removal of files on ADLS after Optimize and Vacuum Dry run is completed

6 REPLIES 6

ravikanthranjit
New Contributor III

Credits to one of the community member from which I took the code of file existence

ravikanthranjit
New Contributor III

Want to know community members feedback on the below code which can work for specific table that is specified, this can be parameterized and run.

But is this the best way to manage (delete unwanted files of Delta tables that are externally stored in ADLS). Please let me know.

def file_exists_delete(path):
    try:
        dbutils.fs.ls(path)
        dbutils.fs.rm(path)
        print('removed the file '+path)
        return True
    except Exception as e:
        if 'java.io.FileNotFoundException' in str(e):
            return False
        else:
            raise
 
  
 #Copy in Seperate Cell
spark.sql("OPTIMIZE tbl_name")
df=spark.sql("VACUUM tbl_name RETAIN 0 HOURS DRY RUN")
 
 
#Copy In seperate Cell
df_collect=df.collect()
 
#Copy in Seperate Cell and execute
for row in df_collect:
     file_exists_delete(row[0])

-werners-
Esteemed Contributor III

do not remove files from delta lake tables manually. That is why vacuum exists.

It can lead to a corrupt table.

Why not just run a vacuum without the dry run?

-werners-
Esteemed Contributor III

vacuum will actually remove not used files (without the dry run option), depending on the retention interval.

check this topic

Hubert-Dudek
Esteemed Contributor III

If you have external delta files, you can use Python syntax to clean them using path

from delta.tables import *
 
deltaTable = DeltaTable.forPath(spark, pathToTable)
 
deltaTable.vacuum()

Anonymous
Not applicable

Hi @Ravikanth Narayanabhatla​ 

Hope all is well!

Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. 

We'd love to hear from you.

Thanks!

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group