Databricks

Abhay_1002 · 2 weeks ago

In my spark application, I am using set of python libraries. I am submitting spark application as Jar Task. But I am not able to find any option provide Archive Files.

So, in order to handle python dependencies, I am using approach:

Create archive file of Python virtual environment using required set of libraries (<environment_name.tar.gz>)
Keep it on DBFS
In code, add this archive file using sparkSession.sparkContext.addArchive(<dbfs:/environment_name.tar.gz>)

As spark suggested, this archive file can be accessible and use with relative path ./<environment_name>/bin/python

But when I am running application on data bricks cluster, I am getting following error
Path not found : ./environment/bin/python

I checked, this archive file is present in different directories in driver and executor nodes. There is no classpath directory as such and due to this I can not use relative path.

Following are directories where archive file is present in different nodes
driver : /local_disk0/spark-xxxxxx-xxxx-xxxx/userFiles-xxxxx-xxxxx-xxxxx/
executor : /local_disk0/spark-xxxxx-xxxxx-xxxxx/executor-xxxxxx-xxxxxx-xxxxxxxx/spark-xxxxx-xxxx-xxxxx-xxxx/

Please suggest how I can use archive files in spark application submitted as Jar Type.

Version : Databricks Runtime Version 12.2 LTS (includes Apache Spark 3.3.2, Scala 2.12)

Kaniz · 2 weeks ago

Hi @Abhay_1002,

Using --py-files Argument: When submitting a Spark application, you can use the --py-files argument ...¹. However, this approach is typically used for Python files directly, not for virtual environments.
Alternative Approach: Since you’re specifically dealing with a Python virtual environment archive, the addArchive method you mentioned is the right way to distribute it. However, the issue you’re facing might be related to the relative path when running on the cluster.
Classpath and Relative Paths: You mentioned that there is no classpath directory, which is correct. When using addArchive, the archive file is added to the classpath of the Spark application. However, relative paths within the archive are resolved based on the working directory of the Spark driver or executor process.
Working Directory: The working directory for the driver and executor nodes might be different. To ensure that relative paths work consistently, consider the following steps:
- Use Absolute Paths: Instead of relying on relative paths, use absolute paths within your Python code to access the necessary files within the virtual environment.
- Set Working Directory: Explicitly set the working directory for your Spark application using os.chdir() in your Python code. This ensures that relative paths are resolved consistently across all nodes.
Debugging Steps: To debug further, you can:
- Print the current working directory (os.getcwd()) in your Spark application to verify where it’s running.
- Check if the archive file is accessible from the working directory.
- Ensure that the archive file is correctly added using sparkContext.addArchive().

Remember that the working directory can vary between the driver and executor nodes, so using absolute paths or explicitly setting the working directory can help resolve relative path issues.

I hope this helps! If you have any further questions or need additional assistance, feel free to ask. 😊

Databricks

Archive file support in Jar Type application

Exciting Announcement: Introducing our Learning Library!

Databricks Community Social - May 2024

🔔 Attention Databricks Academy Users: SSO Implementation Incoming! Secure Your Account Today!

Announcing the General Availability of Databricks Asset Bundles

How to successfully build GenAI applications