In my spark application, I am using set of python libraries. I am submitting spark application as Jar Task. But I am not able to find any option provide Archive Files.
So, in order to handle python dependencies, I am using approach:
- Create archive file of Python virtual environment using required set of libraries (<environment_name.tar.gz>)
- Keep it on DBFS
- In code, add this archive file using sparkSession.sparkContext.addArchive(<dbfs:/environment_name.tar.gz>)
As spark suggested, this archive file can be accessible and use with relative path ./<environment_name>/bin/python
But when I am running application on data bricks cluster, I am getting following error
Path not found : ./environment/bin/python
I checked, this archive file is present in different directories in driver and executor nodes. There is no classpath directory as such and due to this I can not use relative path.
Following are directories where archive file is present in different nodes
driver : /local_disk0/spark-xxxxxx-xxxx-xxxx/userFiles-xxxxx-xxxxx-xxxxx/
executor : /local_disk0/spark-xxxxx-xxxxx-xxxxx/executor-xxxxxx-xxxxxx-xxxxxxxx/spark-xxxxx-xxxx-xxxxx-xxxx/
Please suggest how I can use archive files in spark application submitted as Jar Type.
Version : Databricks Runtime Version 12.2 LTS (includes Apache Spark 3.3.2, Scala 2.12)