Databricks Community

ravikanthranjit · ‎10-12-2022

Want to know the best process of removal of files on ADLS after Optimize and Vacuum Dry run is completed

ravikanthranjit · ‎10-12-2022

Credits to one of the community member from which I took the code of file existence

ravikanthranjit · ‎10-12-2022

Want to know community members feedback on the below code which can work for specific table that is specified, this can be parameterized and run.

But is this the best way to manage (delete unwanted files of Delta tables that are externally stored in ADLS). Please let me know.

def file_exists_delete(path):
    try:
        dbutils.fs.ls(path)
        dbutils.fs.rm(path)
        print('removed the file '+path)
        return True
    except Exception as e:
        if 'java.io.FileNotFoundException' in str(e):
            return False
        else:
            raise
 
  
 #Copy in Seperate Cell
spark.sql("OPTIMIZE tbl_name")
df=spark.sql("VACUUM tbl_name RETAIN 0 HOURS DRY RUN")
 
 
#Copy In seperate Cell
df_collect=df.collect()
 
#Copy in Seperate Cell and execute
for row in df_collect:
     file_exists_delete(row[0])

-werners- · ‎10-13-2022

do not remove files from delta lake tables manually. That is why vacuum exists.

It can lead to a corrupt table.

Why not just run a vacuum without the dry run?

-werners- · ‎10-13-2022

vacuum will actually remove not used files (without the dry run option), depending on the retention interval.

check this topic

Hubert-Dudek · ‎10-16-2022

If you have external delta files, you can use Python syntax to clean them using path

from delta.tables import *
 
deltaTable = DeltaTable.forPath(spark, pathToTable)
 
deltaTable.vacuum()

Anonymous · ‎11-19-2022

Hi @Ravikanth Narayanabhatla

Hope all is well!

Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help.

We'd love to hear from you.

Thanks!