Databricks

r-g-s-j · ‎08-19-2022

Issue

I am attempting to create a PySpark job via the Databricks UI (with spark-submit) using the parameters below (dependencies are on the PEX file), but I am getting the an exception that the pex file does not exist. It's my understanding that the --files option puts the file in the working directory of the driver & every executor, so I am confused as to why I am encountering this issue.

Config

[
"--files","s3://some_path/my_pex.pex",
"--conf","spark.pyspark.python=./my_pex.pex",
"s3://some_path/main.py",
"--some_arg","2022-08-01"
]

Standard Error

OpenJDK 64-Bit Server VM warning: ignoring option MaxPermSize=512m; support was removed in 8.0
Warning: Ignoring non-Spark config property: libraryDownload.sleepIntervalSeconds
Warning: Ignoring non-Spark config property: libraryDownload.timeoutSeconds
Warning: Ignoring non-Spark config property: eventLog.rolloverIntervalSeconds
Exception in thread "main" java.io.IOException: Cannot run program "./my_pex.pex": error=2, No such file or directory
	at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
	at org.apache.spark.deploy.PythonRunner$.main(PythonRunner.scala:97)
	at org.apache.spark.deploy.PythonRunner.main(PythonRunner.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:951)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1039)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1048)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.io.IOException: error=2, No such file or directory
	at java.lang.UNIXProcess.forkAndExec(Native Method)
	at java.lang.UNIXProcess.<init>(UNIXProcess.java:247)
	at java.lang.ProcessImpl.start(ProcessImpl.java:134)
	at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
	... 14 more

What I have tried

Given that the PEX file doesn't seem to be visible, I have tried adding it via the following ways:

Adding the PEX via the --files option in Spark submit
Adding the PEX via the the spark.files config when starting up the actual cluster
Playing around with the configs (e.g. using spark.pyspark.driver.python instead of spark.pyspark.python)

Note: given that instructions at the bottom of this page, I believe PEX should work on Databricks; I'm just not sure as to the right configs: https://www.databricks.com/blog/2020/12/22/how-to-manage-python-dependencies-in-pyspark.html

Note also, the following spark submit command works on AWS EMR:

'HadoopJarStep': {
            'Jar': 'command-runner.jar',
            'Args': [
                "spark-submit",
                "--deploy-mode", "cluster", 
                "--master", "yarn",
                "--files", "s3://some_path/my_pex.pex", 
                "--conf", "spark.pyspark.driver.python=./my_pex.pex",
                "--conf", "spark.executorEnv.PEX_ROOT=./tmp",
                "--conf", "spark.yarn.appMasterEnv.PEX_ROOT=./tmp",
                "s3://some_path/main.py",
                "--some_arg", "some-val"
            ],

Any help would be much appreciated, thanks.