cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Delete the file

Kush22
New Contributor

While exporting data from Databricks to Azure blob storage how can I delete the committed, started and success file? โ€‹

1 ACCEPTED SOLUTION

Accepted Solutions

Kaniz
Community Manager
Community Manager

Hi @Kushal Sahaโ€‹ ,

You can use the Azure Databricks utility function 

dbutils.fs.rm

This function leverages the native cloud storage file system API, which is optimized for all file operations.

However, you canโ€™t delete a gigantic table directly using 

dbutils.fs.rm("path/to/the/table")

For smaller tables, the collected paths of the files to delete fit into the driver memory, so you can use a Spark job to distribute the file deletion task.

For gigantic tables, even for a single top-level partition, the string representations of the file paths cannot fit into the driver memory.

The easiest way to solve this problem is to collect the paths of the inner partitions recursively, list the paths, and delete them in parallel.

import scala.util.{Try, Success, Failure}
 
def delete(p: String): Unit = {
  dbutils.fs.ls(p).map(_.path).toDF.foreach { file =>
    dbutils.fs.rm(file(0).toString, true)
    println(s"deleted file: $file")
  }
}
 
final def walkDelete(root: String)(level: Int): Unit = {
  dbutils.fs.ls(root).map(_.path).foreach { p =>
    println(s"Deleting: $p, on level: ${level}")
    val deleting = Try {
      if(level == 0) delete(p)
      else if(p endsWith "/") walkDelete(p)(level-1)
      //
      // Set only n levels of recursion, so it won't be a problem
      //
      else delete(p)
    }
    deleting match {
      case Success(v) => {
        println(s"Successfully deleted $p")
        dbutils.fs.rm(p, true)
      }
      case Failure(e) => println(e.getMessage)
    }
  }
}

The code deletes inner partitions while ensuring that the partition that is being deleted is small enough. It does this by searching through the partitions recursively by each level and only starts deleting when it hits the level you set.

For instance, if you want to start with deleting the top-level partitions, use 

walkDelete(root)(0).

Spark will delete all the files under dbfs:/mnt/path/table/a=1/, then delete .../a=2/, following the pattern until it is exhausted.

Source

View solution in original post

2 REPLIES 2

Kaniz
Community Manager
Community Manager

Hi @Kushal Sahaโ€‹ ,

You can use the Azure Databricks utility function 

dbutils.fs.rm

This function leverages the native cloud storage file system API, which is optimized for all file operations.

However, you canโ€™t delete a gigantic table directly using 

dbutils.fs.rm("path/to/the/table")

For smaller tables, the collected paths of the files to delete fit into the driver memory, so you can use a Spark job to distribute the file deletion task.

For gigantic tables, even for a single top-level partition, the string representations of the file paths cannot fit into the driver memory.

The easiest way to solve this problem is to collect the paths of the inner partitions recursively, list the paths, and delete them in parallel.

import scala.util.{Try, Success, Failure}
 
def delete(p: String): Unit = {
  dbutils.fs.ls(p).map(_.path).toDF.foreach { file =>
    dbutils.fs.rm(file(0).toString, true)
    println(s"deleted file: $file")
  }
}
 
final def walkDelete(root: String)(level: Int): Unit = {
  dbutils.fs.ls(root).map(_.path).foreach { p =>
    println(s"Deleting: $p, on level: ${level}")
    val deleting = Try {
      if(level == 0) delete(p)
      else if(p endsWith "/") walkDelete(p)(level-1)
      //
      // Set only n levels of recursion, so it won't be a problem
      //
      else delete(p)
    }
    deleting match {
      case Success(v) => {
        println(s"Successfully deleted $p")
        dbutils.fs.rm(p, true)
      }
      case Failure(e) => println(e.getMessage)
    }
  }
}

The code deletes inner partitions while ensuring that the partition that is being deleted is small enough. It does this by searching through the partitions recursively by each level and only starts deleting when it hits the level you set.

For instance, if you want to start with deleting the top-level partitions, use 

walkDelete(root)(0).

Spark will delete all the files under dbfs:/mnt/path/table/a=1/, then delete .../a=2/, following the pattern until it is exhausted.

Source

Kaniz
Community Manager
Community Manager

Hi @Kushal Sahaโ€‹ , How are you doing? Were you able to resolve your query? Did my suggestion help you?