cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Vacuum on external tables that we mount on ADLS

ravikanthranjit
New Contributor III

Want to know the best process of removal of files on ADLS after Optimize and Vacuum Dry run is completed

6 REPLIES 6

ravikanthranjit
New Contributor III

Credits to one of the community member from which I took the code of file existence

ravikanthranjit
New Contributor III

Want to know community members feedback on the below code which can work for specific table that is specified, this can be parameterized and run.

But is this the best way to manage (delete unwanted files of Delta tables that are externally stored in ADLS). Please let me know.

def file_exists_delete(path):
    try:
        dbutils.fs.ls(path)
        dbutils.fs.rm(path)
        print('removed the file '+path)
        return True
    except Exception as e:
        if 'java.io.FileNotFoundException' in str(e):
            return False
        else:
            raise
 
  
 #Copy in Seperate Cell
spark.sql("OPTIMIZE tbl_name")
df=spark.sql("VACUUM tbl_name RETAIN 0 HOURS DRY RUN")
 
 
#Copy In seperate Cell
df_collect=df.collect()
 
#Copy in Seperate Cell and execute
for row in df_collect:
     file_exists_delete(row[0])

-werners-
Esteemed Contributor III

do not remove files from delta lake tables manually. That is why vacuum exists.

It can lead to a corrupt table.

Why not just run a vacuum without the dry run?

-werners-
Esteemed Contributor III

vacuum will actually remove not used files (without the dry run option), depending on the retention interval.

check this topic

Hubert-Dudek
Esteemed Contributor III

If you have external delta files, you can use Python syntax to clean them using path

from delta.tables import *
 
deltaTable = DeltaTable.forPath(spark, pathToTable)
 
deltaTable.vacuum()

Anonymous
Not applicable

Hi @Ravikanth Narayanabhatla​ 

Hope all is well!

Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. 

We'd love to hear from you.

Thanks!

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.