Hi @Kushal Sahaโ ,
You can use the Azure Databricks utility function
dbutils.fs.rm
This function leverages the native cloud storage file system API, which is optimized for all file operations.
However, you canโt delete a gigantic table directly using
dbutils.fs.rm("path/to/the/table")
For smaller tables, the collected paths of the files to delete fit into the driver memory, so you can use a Spark job to distribute the file deletion task.
For gigantic tables, even for a single top-level partition, the string representations of the file paths cannot fit into the driver memory.
The easiest way to solve this problem is to collect the paths of the inner partitions recursively, list the paths, and delete them in parallel.
import scala.util.{Try, Success, Failure}
def delete(p: String): Unit = {
dbutils.fs.ls(p).map(_.path).toDF.foreach { file =>
dbutils.fs.rm(file(0).toString, true)
println(s"deleted file: $file")
}
}
final def walkDelete(root: String)(level: Int): Unit = {
dbutils.fs.ls(root).map(_.path).foreach { p =>
println(s"Deleting: $p, on level: ${level}")
val deleting = Try {
if(level == 0) delete(p)
else if(p endsWith "/") walkDelete(p)(level-1)
//
// Set only n levels of recursion, so it won't be a problem
//
else delete(p)
}
deleting match {
case Success(v) => {
println(s"Successfully deleted $p")
dbutils.fs.rm(p, true)
}
case Failure(e) => println(e.getMessage)
}
}
}
The code deletes inner partitions while ensuring that the partition that is being deleted is small enough. It does this by searching through the partitions recursively by each level and only starts deleting when it hits the level you set.
For instance, if you want to start with deleting the top-level partitions, use
walkDelete(root)(0).
Spark will delete all the files under dbfs:/mnt/path/table/a=1/, then delete .../a=2/, following the pattern until it is exhausted.
Source