- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-14-2024 01:21 PM
I have around 20 pgp files in a folder in my volume that I need to decrypt. I have a decryption function that accepts a file name and writes the decrypted file to a new folder in the same volume.
I had thought I could create a spark dataframe with the name of each file and then create a udf from my decryption function and apply it to each file. I believed this approach would allow parallel decryption of the files. They are very large so it takes time doing it linearly.
The issue I have is that even though the file is there and I can run the process on an individual file with no issues, when trying to apply the udf to the dataframe of file names, I get an 'PermissionError: [Errno 13] Permission denied: "Path_to_file.pgp"' message.
Is this not something I can do with spark?
- Labels:
-
Spark
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-14-2024 04:07 PM
Spark's error happens because the worker nodes can't access your local files. Instead of using Spark to decrypt, try doing it outside of Spark using Python's multiprocessing or a simple batch script for parallel processing. Another option is to move the files to something like HDFS or S3 so all nodes can access them. Spark works best for processing data, not handling files directly.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-14-2024 04:07 PM
Spark's error happens because the worker nodes can't access your local files. Instead of using Spark to decrypt, try doing it outside of Spark using Python's multiprocessing or a simple batch script for parallel processing. Another option is to move the files to something like HDFS or S3 so all nodes can access them. Spark works best for processing data, not handling files directly.

