cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Question: Decrypt many files with UDF

ggsmith
New Contributor III

I have around 20 pgp files in a folder in my volume that I need to decrypt. I have a decryption function that accepts a file name and writes the decrypted file to a new folder in the same volume. 

I had thought I could create a spark dataframe with the name of each file and then create a udf from my decryption function and apply it to each file. I believed this approach would allow parallel decryption of the files. They are very large so it takes time doing it linearly. 

The issue I have is that even though the file is there and I can run the process on an individual file with no issues, when trying to apply the udf to the dataframe of file names, I get an 'PermissionError: [Errno 13] Permission denied: "Path_to_file.pgp"' message.

Is this not something I can do with spark?

1 ACCEPTED SOLUTION

Accepted Solutions

Brahmareddy
Valued Contributor III

Spark's error happens because the worker nodes can't access your local files. Instead of using Spark to decrypt, try doing it outside of Spark using Python's multiprocessing or a simple batch script for parallel processing. Another option is to move the files to something like HDFS or S3 so all nodes can access them. Spark works best for processing data, not handling files directly.

View solution in original post

1 REPLY 1

Brahmareddy
Valued Contributor III

Spark's error happens because the worker nodes can't access your local files. Instead of using Spark to decrypt, try doing it outside of Spark using Python's multiprocessing or a simple batch script for parallel processing. Another option is to move the files to something like HDFS or S3 so all nodes can access them. Spark works best for processing data, not handling files directly.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group