Databricks Community

Nandini · ‎12-05-2022

I am trying to parallelise the execution of file copy in Databricks. Making use of multiple executors is one way. So, this is the piece of code that I wrote in pyspark.

def parallel_copy_execution(src_path: str, target_path: str):
  files_in_path = dbutils.fs.ls(src_path)
  file_paths_df = spark.sparkContext.parallelize(files_in_path).toDF()
  file_paths_df.foreach(lambda x: dbutils.fs.cp(x.path.toString(), target_path, recurse=True))

I fetched all the files to copy and created a Dataframe. And when trying to run a foreach on top of the DataFrame I am getting the following error. It says that

`You cannot use dbutils within a spark job`

You cannot use dbutils within a spark job or otherwise pickle it.
            If you need to use getArguments within a spark job, you have to get the argument before
            using it in the job. For example, if you have the following code:
 
              myRdd.map(lambda i: dbutils.args.getArgument("X") + str(i))
 
            Then you should use it this way:
 
              argX = dbutils.args.getArgument("X")
              myRdd.map(lambda i: argX + str(i))

But when I try the same in Scala. It works perfectly. The dbutils is used inside a spark job then. Attaching that piece of code as well.

def parallel_copy_execution(p: String, t: String): Unit = {
  dbutils.fs.ls(p).map(_.path).toDF.foreach { file =>
    dbutils.fs.cp(file(0).toString,t , recurse=true)
    println(s"cp file: $file")
  }
}

Is the Pyspark API's not updated to handle this?

If yes, please suggest an alternative to process parallel the dbutils command.

Ajay-Pandey · ‎12-05-2022

I think Pyspark API does not support it now.

Ajay Kumar Pandey

Nandini · ‎12-05-2022

Thanks! Is there any other alternative of parallel processing ?

pulkitm · ‎04-16-2023

@Nandini Raja Were you able to find a solution for this? We are trying to bulk copy files from s3 to ADLS Gen2 and dbutils being single threaded is a pain. I even tried the scala code that worked for you but I get the below error:

Caused by: KeyProviderException: Failure to initialize configuration

Caused by: InvalidConfigurationValueException: Invalid configuration value detected for fs.azure.account.key

This may be because this conf is not available at executor level.

Any solution you were able to figure out?

KVNARK · ‎12-05-2022

May be we can try FileUtil.copy

Anonymous · ‎12-05-2022

Could you use delta clone to copy tables?

Hubert-Dudek · ‎12-05-2022

Yes, I think the best will be to rebuild the code entirely and use, for example, COPY INTO.

dbutils utilize just one core
rdd, is not optimized by the catalyst, and AQE
high-level code like COPY INTO is executed distributed way and is optimized

My blog: https://databrickster.medium.com/

Hubert-Dudek · ‎12-05-2022

Alternatively, copy files using Azure Data Factory. It has great throughput.

My blog: https://databrickster.medium.com/

VaibB · ‎12-05-2022

You can't use dbutils command inside pyspark API. Try using s3 copy or equivalent in Azure.

Matt101122 · ‎01-10-2023

@Nandini Raja I did something similar by using shutil instead of the dbutils. This worked for copying many local files to Azure Storage in paralell. However, the issue I'm having now is finding a Unity Catalog friendly solution as mounting Azure Storage isn't recommended. (shutil and os won't work with abfss:// paths)

ACrampton · ‎08-20-2025

Hey @Matt101122 , did you find a Unity Ctalog friendly solution?
I'm trying to run a streaming job using autoloader to copy files from one UC managed volume to another, running into the issue where dbutils cp command is not working

Matt101122 · ‎08-20-2025

@ACrampton I'm trying to remember the limitation at the time... I think my comment was pre-volumes. If you are using volumes you should be able to use libs shutil and os now.

Etyr · ‎01-11-2023

If you have spark session, you can use Spark hidden File System:

# Get FileSystem from SparkSession
fs = spark._jvm.org.apache.hadoop.fs.FileSystem.get(spark._jsc.hadoopConfiguration())
# Get Path class to convert string path to FS path
path = spark._jvm.org.apache.hadoop.fs.Path
 
# List files
fs.listStatus(path("/path/to/data")) # Should work with mounted points
# Rename file
fs.rename(path("OriginalName"), path("NewName"))
# Delete file
fs.delete(path("/path/to/data"))
# Upload file to DBFS root
fs.copyFromLocalFile(path(local_file_path), path(remote_file_path))
# Upload file to DBFS root
fs.copyToLocalFile(path(remote_file_path), path(local_file_path))

If you have an Azure Storage, you should mount it to you cluster and then you can access it with either `abfss://` or `/mnt/`

Databricks Community

Pyspark: You cannot use dbutils within a spark job

Join Us as a Local Community Builder!

PSA: Community Edition retires at the end of 2025 - move to Free Edition today to keep your work.

🎤 Call for Presentations: Data + AI Summit 2026 is Open!

Last Chance: Help Shape the 2026 Data + AI Summit | Win a Full Conference Pass

🌟 Community Pulse: Your Weekly Roundup! December 05 – 11, 2025

Jaipur Usergroup First Virtual Meetup: AI/BI Genie + Data Science Careers — 19 Dec | 6 PM IST