I am working on a migration project to migrate the HDFS command executions I do inside my Python code via os.system() function by replacing it with dbutils functions.
dbutils functions are working as expected if I pass the fully qualified path of a file but it is not working when I try to pass a wild card.
The current project has multiple HDFS commands (cp, rm, mv, etc.,) with wild card expressions.
Currently, I see two ways, one, to mitigate the issue is by listing all the objects as a list and then applying filters on that list, and then looping through the list and completing the operation but this is not an efficient way compared to bulk copy/move/remove commands and two, by using the boto3 utils to do this operation.
Is there a way in Databricks to do bulk copy/move/remove files from one S3 folder to another S3 folder by using the wild card expressions?
Here are the example commands:
hdfs dfs -cp -f s3a://<bucket>/folder1/some_prefix*.csv s3a://<bucket>/folder2/
hdfs dfs -mv -f s3a://<bucket>/folder1/some_prefix*.csv s3a://<bucket>/folder2/
hdfs dfs -rm -r -skipTrash s3a://<bucket>/folder1/some_prefix*.csv
Following are the exceptions I got while trying the dbutils.fs.ls():
dbutils.fs.ls("s3a://<bucket>/folder1/some_prefix*.csv") --> java.io.FileNotFoundException: No such file or directory
dbutils.fs.ls("s3a://<bucket>/folder1/some_prefix\*\.csv") --> java.util.concurrent.ExecutionException: com.databricks.sql.managedcatalog.acl.UnauthorizedAccessException: PERMISSION_DENIED:
If I execute dbutils.fs.ls("s3a://<bucket>/folder1/"), then there are no issues.
I feel like there is no support for wild card expressions in dbutils functions but I would like to see if anyone have already had this problem and what kind of best alternative people used to mitigate this issue.
Thanks
Ramana