01-12-2024 06:20 PM - edited 01-12-2024 06:22 PM
I need to delete 50TB of data out of dfbs storage. It is overpartitioned and dbutils does not work. Also, limiting partition size and iterating over data to delete doesn't work. Azure locks access from storage from the resource group permissions and even though I'm the owner and admin of the Azure subscription I still can't grant myself access.
I've isolated the problem to a single hive_metastore catalog that was deleted yet the directory and data for two "tables" still remain. It has been over 30 days and was not cleaned out even though it is deleted. This extra data has increased our storage costs significantly.
https://community.databricks.com/t5/data-engineering/how-do-i-delete-files-from-the-dbfs/td-p/29185 doesn't work
https://stackoverflow.com/questions/54091812/remove-files-from-directory-after-uploading-in-databric... doesn't work
https://stackoverflow.com/questions/54081662/how-to-delete-all-files-from-folder-with-databricks-dbu... doesn't work
https://docs.databricks.com/api/workspace/dbfs/delete doesn't work and it says "when you delete a large number of files, the delete operation is done in increments."
https://kb.databricks.com/data/list-delete-files-faster works but is so slow it would take months and months to run through this data that is extremely overpartitioned.
In regular Azure blob storage that isn't auto-created and managed by databricks, it is very easy to delete data. Click on the container or directory or file then click delete and it is gone.
How do I just delete a massive amount of data in databricks storage?
Thanks in advance.
01-13-2024 02:45 PM
@dmart - If it is over partitioned, The easiest way is to solve by collect the paths of the inner partitions recursively, list them and delete them in parallel.
01-14-2024 12:24 PM - edited 01-14-2024 12:38 PM
@shan_chandra - given the information I provided, did I not already try this?
Maybe I’m not understanding your response but isn’t this just iterating the delete operation through the data? This is not what I’m asking for so maybe I should have been more clear.
Do you think your response actually answers my question?
If yes, then please provide some example code.
01-16-2024 07:25 AM
@dmart - Let's take a step back..
If the file path you want to delete is as follows
/mnt/path/table
and the internal partitioned structure is
/mnt/path/table/date=2022-01-03
/mnt/path/table/date=2022-01-04
Sample code to recursively list and delete the same
def recursive_ls(path):
result = dbutils.fs.ls(path)
files = [file for file in result if file.isFile()]
directories = [dir for dir in result if dir.isDir()]
for dir in directories:
files += recursive_ls(dir.path)
return files
files = recursive_ls("<parent directory path>")
for file in files:
print(file.path)
<dbutils.fs.rm(file.path)>
If the two tables listed are delta table, could you please try checking if any properties are set within the table by checking describe extended related to deleted file retention etc.
01-16-2024 10:40 AM
I've tried to collect the paths of the inner partitions recursively, list the paths, and delete them in parallel. (https://kb.databricks.com/data/list-delete-files-faster) Here is the online example. I get it to run but the delete is so slow that it will take months and months even when I scale up compute. I tried your code but results are the same dbutils.fs.ls(path) exhausts memory.
It is hard to believe this is the easiest way to delete a massive amount of messy data from Azure storage in the Databricks managed resource. In regular Azure storage not managed by databricks, this would be extremely easy to delete directly in Azure portal.
If there isn't an answer then we will literally have to migrate to a completely new separate databricks resource, then delete this whole databricks resource in order to delete the storage and eliminate the unnecessary costs. We have legacy processes that still run on this resource where this would take a lot of work. This is a poor solution path to just delete data.
01-16-2024 10:56 AM
Is there a way to just delete the folder and everything inside?
This is possible in Azure storage but I don't have access directly to the data because the auto-created databricks resource group locks me out even though I am owner and admin of Azure subscription.
This is how easy it is in Azure Storage where I have access.
Databricks data is stored in Azure storage so there is no reason this should not be possible. If it is not possible in Databricks then how do I gain access to raw data in Azure storage and delete it there? In other words, how do I get around the resource group restrictions as an Azure subscription owner and admin?
01-16-2024 11:00 AM - edited 01-16-2024 11:01 AM
@dmart - The delete operation is driver only and we need a beefy driver. Inorder to continue support this, can you please create a support case through azure support? Rest assured you will be in the best technical hands to resolve this.
01-16-2024 11:43 AM
Ok thank you. I'll try that and give an update here.
01-26-2024 06:41 AM
I've been working with Azure support for a week now and they haven't solved the problem. They keep trying to do it through Databricks and have not been successful.
It appears there is no solution. The only option is to delete the entire workspace.
01-26-2024 07:17 AM - edited 01-26-2024 07:22 AM
Azure support can't solve the problem.
It appears there is no solution and the answer to my question is "it can't be deleted". The only option is to delete the entire workspace.
This definitely is NOT a solution to my problem or even what I was asking for in this post but this is the closest I could get. This actually starts deleting data but at a pace that will take months or years to finish given data I need deleted.
CELL_1
%scala
//collect the paths of the inner partitions recursively, list the paths, and delete them in parallel (limit partition size to memory exhaustion)
import scala.util.{Try, Success, Failure}
def delete(p: String): Unit = {
dbutils.fs.ls(p).map(_.path).toDF.foreach { file =>
dbutils.fs.rm(file(0).toString, true)
println(s"deleted file: $file")
}
}
final def walkDelete(root: String)(level: Int): Unit = {
dbutils.fs.ls(root).map(_.path).foreach { p =>
println(s"Deleting: $p, on level: ${level}")
val deleting = Try {
if(level == 0) delete(p)
else if(p endsWith "/") walkDelete(p)(level-1)
else delete(p)
}
deleting match {
case Success(v) => {
println(s"Successfully deleted $p")
dbutils.fs.rm(p, true)
}
case Failure(e) => println(e.getMessage)
}
}
}
CELL_2
%scala
val root = "/path/to/data"
walkDelete(root)(0)
01-26-2024 08:09 AM
@dmart - can you please try with a beefy driver and list out the root paths Eg. "/path/to/data1" , "/path/to/data2" ... and run the above code in multiple clusters by changing the root path to the above script. Please let us know if this works for you.
01-26-2024 11:29 AM
@shan_chandra - the root path is "/path/to/data1/" then hundreds of millions of folders and every one of those is 100 or more folders then data files.
I tried this config and it is still slow. What would you consider a beefy driver?
Thanks
03-20-2024 02:00 PM
For anyone else with this issue, there is no solution other than deleting the whole databricks workspace which then deletes all the resources locked up in the managed resource group. The data could not be deleted in any other way, not even by Microsoft leveraging their highest level of Azure support service and escalating across their teams.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group