07-14-2023 01:37 PM
I am working on a migration project to migrate the HDFS command executions I do inside my Python code via os.system() function by replacing it with dbutils functions.
dbutils functions are working as expected if I pass the fully qualified path of a file but it is not working when I try to pass a wild card.
The current project has multiple HDFS commands (cp, rm, mv, etc.,) with wild card expressions.
Currently, I see two ways, one, to mitigate the issue is by listing all the objects as a list and then applying filters on that list, and then looping through the list and completing the operation but this is not an efficient way compared to bulk copy/move/remove commands and two, by using the boto3 utils to do this operation.
Is there a way in Databricks to do bulk copy/move/remove files from one S3 folder to another S3 folder by using the wild card expressions?
Here are the example commands:
hdfs dfs -cp -f s3a://<bucket>/folder1/some_prefix*.csv s3a://<bucket>/folder2/
hdfs dfs -mv -f s3a://<bucket>/folder1/some_prefix*.csv s3a://<bucket>/folder2/
hdfs dfs -rm -r -skipTrash s3a://<bucket>/folder1/some_prefix*.csv
Following are the exceptions I got while trying the dbutils.fs.ls():
07-14-2023 10:51 PM
HI @Ramana, if you create mount point of the S3 bucket in databricks it will help you to leverage the functionality of the glob and os python module.
Suppose your mount point is"/mnt/s3", just change it into '/dbfs/mnt/s3' and use glob and os module.
Hope this will help you.
Thanks,
07-17-2023 07:19 AM - edited 07-17-2023 07:20 AM
The question is not about accessing the S3 inside Databricks but it is about using wildcard expressions to filter and group (bulk) the file operations.
FYI: we have a mounted S3 as well as an external S3 and we would like to do these operations on the External S3 location.
07-15-2023 01:50 AM
HI @Ramana
Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help.
We'd love to hear from you.
Thanks!
07-17-2023 07:16 AM - edited 07-17-2023 07:24 AM
The option you suggested is just an alternative to boto3 API but it is not related to the wildcard option I need.
Even if I do the S3 mound, still I am not able to do wildcard operations in Databricks.
The options I have are either with boto3 or with capturing the dbutils.fs.ls as a list and then iterating through and doing the necessary operations. But these are not part of dbutils. I feel like dbutils only support the operations either at a folder level or at a single file level. My requirement is to copy/delete/move multiple files (bulk operation) by filtering with prefixes and suffixes and currently, I am not able to do it in Databricks with dbutils.
09-18-2023 08:38 AM
As I mentioned in my problem statement, I have already implemented the required functionality with alternative approaches (AWS S3 API and BOTO3 API).
Still, it is an outstanding issue with DBUTILS.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group