Databricks Community

Ramana · ‎07-14-2023

I am working on a migration project to migrate the HDFS command executions I do inside my Python code via os.system() function by replacing it with dbutils functions.

dbutils functions are working as expected if I pass the fully qualified path of a file but it is not working when I try to pass a wild card.

The current project has multiple HDFS commands (cp, rm, mv, etc.,) with wild card expressions.

Currently, I see two ways, one, to mitigate the issue is by listing all the objects as a list and then applying filters on that list, and then looping through the list and completing the operation but this is not an efficient way compared to bulk copy/move/remove commands and two, by using the boto3 utils to do this operation.

Is there a way in Databricks to do bulk copy/move/remove files from one S3 folder to another S3 folder by using the wild card expressions?

Here are the example commands:

hdfs dfs -cp -f s3a://<bucket>/folder1/some_prefix*.csv s3a://<bucket>/folder2/

hdfs dfs -mv -f s3a://<bucket>/folder1/some_prefix*.csv s3a://<bucket>/folder2/

hdfs dfs -rm -r -skipTrash s3a://<bucket>/folder1/some_prefix*.csv

Following are the exceptions I got while trying the dbutils.fs.ls():

dbutils.fs.ls("s3a://<bucket>/folder1/some_prefix*.csv") --> java.io.FileNotFoundException: No such file or directory

dbutils.fs.ls("s3a://<bucket>/folder1/some_prefix\*\.csv") --> java.util.concurrent.ExecutionException: com.databricks.sql.managedcatalog.acl.UnauthorizedAccessException: PERMISSION_DENIED:

If I execute dbutils.fs.ls("s3a://<bucket>/folder1/"), then there are no issues.

I feel like there is no support for wild card expressions in dbutils functions but I would like to see if anyone have already had this problem and what kind of best alternative people used to mitigate this issue.

Thanks

Ramana

Hemant · ‎07-14-2023

HI @Ramana, if you create mount point of the S3 bucket in databricks it will help you to leverage the functionality of the glob and os python module.

Suppose your mount point is"/mnt/s3", just change it into '/dbfs/mnt/s3' and use glob and os module.

Screenshot 2023-07-15 111805.png

Hope this will help you.

Thanks,

Hemant Soni

Ramana · ‎07-17-2023

The question is not about accessing the S3 inside Databricks but it is about using wildcard expressions to filter and group (bulk) the file operations.

FYI: we have a mounted S3 as well as an external S3 and we would like to do these operations on the External S3 location.

Anonymous · ‎07-15-2023

HI @Ramana

Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help.

We'd love to hear from you.

Thanks!

Ramana · ‎07-17-2023

The option you suggested is just an alternative to boto3 API but it is not related to the wildcard option I need.

Even if I do the S3 mound, still I am not able to do wildcard operations in Databricks.

The options I have are either with boto3 or with capturing the dbutils.fs.ls as a list and then iterating through and doing the necessary operations. But these are not part of dbutils. I feel like dbutils only support the operations either at a folder level or at a single file level. My requirement is to copy/delete/move multiple files (bulk operation) by filtering with prefixes and suffixes and currently, I am not able to do it in Databricks with dbutils.