<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: How to Configure PySpark Jobs Using PEX in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/how-to-configure-pyspark-jobs-using-pex/m-p/34194#M24964</link>
    <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;I'm facing the same issue trying to execute a pyspark job with spark-submit.&lt;/P&gt;&lt;P&gt;I have explored the same solution as you :&lt;/P&gt;&lt;UL&gt;&lt;LI&gt; --files option&lt;/LI&gt;&lt;LI&gt;spark.pyspark.driver.python&lt;/LI&gt;&lt;LI&gt;spark.executorEnv.PEX_ROOT&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;Do you make some progress in the resolution of the problem ?&lt;/P&gt;</description>
    <pubDate>Fri, 28 Oct 2022 17:27:36 GMT</pubDate>
    <dc:creator>franck</dc:creator>
    <dc:date>2022-10-28T17:27:36Z</dc:date>
    <item>
      <title>How to Configure PySpark Jobs Using PEX</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-configure-pyspark-jobs-using-pex/m-p/34193#M24963</link>
      <description>&lt;P&gt;&lt;B&gt;Issue&lt;/B&gt;&lt;/P&gt;&lt;P&gt;I am attempting to create a PySpark job via the Databricks UI (with spark-submit) using the parameters below (dependencies are on the PEX file), but I am getting the an exception that the pex file does not exist. It's my understanding that the --files option puts the file in the working directory of the driver &amp;amp; every executor, so I am confused as to why I am encountering this issue.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;I&gt;Config&lt;/I&gt;&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;[
"--files","s3://some_path/my_pex.pex",
"--conf","spark.pyspark.python=./my_pex.pex",
"s3://some_path/main.py",
"--some_arg","2022-08-01"
]&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;I&gt;Standard Error&lt;/I&gt;&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;OpenJDK 64-Bit Server VM warning: ignoring option MaxPermSize=512m; support was removed in 8.0
Warning: Ignoring non-Spark config property: libraryDownload.sleepIntervalSeconds
Warning: Ignoring non-Spark config property: libraryDownload.timeoutSeconds
Warning: Ignoring non-Spark config property: eventLog.rolloverIntervalSeconds
Exception in thread "main" java.io.IOException: Cannot run program "./my_pex.pex": error=2, No such file or directory
	at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
	at org.apache.spark.deploy.PythonRunner$.main(PythonRunner.scala:97)
	at org.apache.spark.deploy.PythonRunner.main(PythonRunner.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:951)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1039)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1048)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.io.IOException: error=2, No such file or directory
	at java.lang.UNIXProcess.forkAndExec(Native Method)
	at java.lang.UNIXProcess.&amp;lt;init&amp;gt;(UNIXProcess.java:247)
	at java.lang.ProcessImpl.start(ProcessImpl.java:134)
	at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
	... 14 more&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;B&gt;What I have tried&lt;/B&gt;&lt;/P&gt;&lt;P&gt;Given that the PEX file doesn't seem to be visible, I have tried adding it via the following ways:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Adding the PEX via the --files option in Spark submit&lt;/LI&gt;&lt;LI&gt;Adding the PEX via the the spark.files config when starting up the actual cluster&lt;/LI&gt;&lt;LI&gt;Playing around with the configs (e.g. using spark.pyspark.driver.python instead of spark.pyspark.python)&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Note: given that instructions at the bottom of this page, I believe PEX should work on Databricks; I'm just not sure as to the right configs: &lt;A href="https://www.databricks.com/blog/2020/12/22/how-to-manage-python-dependencies-in-pyspark.html" alt="https://www.databricks.com/blog/2020/12/22/how-to-manage-python-dependencies-in-pyspark.html" target="_blank"&gt;https://www.databricks.com/blog/2020/12/22/how-to-manage-python-dependencies-in-pyspark.html&lt;/A&gt; &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Note also, the following spark submit command works on AWS EMR:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;'HadoopJarStep': {
            'Jar': 'command-runner.jar',
            'Args': [
                "spark-submit",
                "--deploy-mode", "cluster", 
                "--master", "yarn",
                "--files", "s3://some_path/my_pex.pex", 
                "--conf", "spark.pyspark.driver.python=./my_pex.pex",
                "--conf", "spark.executorEnv.PEX_ROOT=./tmp",
                "--conf", "spark.yarn.appMasterEnv.PEX_ROOT=./tmp",
                "s3://some_path/main.py",
                "--some_arg", "some-val"
            ],&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Any help would be much appreciated, thanks.&lt;/P&gt;</description>
      <pubDate>Fri, 19 Aug 2022 20:37:22 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-configure-pyspark-jobs-using-pex/m-p/34193#M24963</guid>
      <dc:creator>r-g-s-j</dc:creator>
      <dc:date>2022-08-19T20:37:22Z</dc:date>
    </item>
    <item>
      <title>Re: How to Configure PySpark Jobs Using PEX</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-configure-pyspark-jobs-using-pex/m-p/34194#M24964</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;I'm facing the same issue trying to execute a pyspark job with spark-submit.&lt;/P&gt;&lt;P&gt;I have explored the same solution as you :&lt;/P&gt;&lt;UL&gt;&lt;LI&gt; --files option&lt;/LI&gt;&lt;LI&gt;spark.pyspark.driver.python&lt;/LI&gt;&lt;LI&gt;spark.executorEnv.PEX_ROOT&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;Do you make some progress in the resolution of the problem ?&lt;/P&gt;</description>
      <pubDate>Fri, 28 Oct 2022 17:27:36 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-configure-pyspark-jobs-using-pex/m-p/34194#M24964</guid>
      <dc:creator>franck</dc:creator>
      <dc:date>2022-10-28T17:27:36Z</dc:date>
    </item>
  </channel>
</rss>

