How can I run a Pyspark python script in a scala environment

177331 — Sun, 30 Jan 2022 02:43:27 GMT

I need to use both Python Spark code and Scala Spark code in my project. A lot of project configuration is written in Scala part, and I want to generate the data from Scala and pass the data path to my Python script. Then I can use the Python ecosystem to train models etc and generate a dataset as a result. Then Scala will read the result and pass it into our downstream system.

However, when I test the code below, I met some issues. Am I wrong on anything? Is there any better way to achieve my goal?

Cmd 2 running print hello script works well

Cmd 4 running Pyspark python script produce such error

Error: Could not find or load main class org.apache.spark.launcher.Main/databricks/spark/bin/spark-class: line 101: CMD: bad array subscriptTraceback (most recent call last): File "/tmp/cli.py", line 23, in <module> cli.main(sys.argv[1:], standalone_mode=False) File "/databricks/python3/lib/python3.8/site-packages/click/core.py", line 1053, in main rv = self.invoke(ctx) File "/databricks/python3/lib/python3.8/site-packages/click/core.py", line 1395, in invoke return ctx.invoke(self.callback, **ctx.params) File "/databricks/python3/lib/python3.8/site-packages/click/core.py", line 754, in invoke return __callback(*args, **kwargs) File "/tmp/cli.py", line 19, in cli spark = SparkSession.builder.getOrCreate() File "/databricks/spark/python/pyspark/sql/session.py", line 229, in getOrCreate sc = SparkContext.getOrCreate(sparkConf) File "/databricks/spark/python/pyspark/context.py", line 392, in getOrCreate SparkContext(conf=conf or SparkConf()) File "/databricks/spark/python/pyspark/context.py", line 145, in __init__ SparkContext._ensure_initialized(self, gateway=gateway, conf=conf) File "/databricks/spark/python/pyspark/context.py", line 339, in _ensure_initialized SparkContext._gateway = gateway or launch_gateway(conf) File "/databricks/spark/python/pyspark/java_gateway.py", line 108, in launch_gateway raise Exception("Java gateway process exited before sending its port number")Exception: Java gateway process exited before sending its port number1

stdout: java.io.PrintStream@2202fa90

stderr: java.io.PrintStream@4133a68d

import sys.process._

callPythonCli: ()Unit

Re: How can I run a Pyspark python script in a scala environment

177331 — Sun, 30 Jan 2022 02:45:44 GMT

The error in the attachment

topic Re: How can I run a Pyspark python script in a scala environment in Warehousing & Analytics

How can I run a Pyspark python script in a scala environment

Re: How can I run a Pyspark python script in a scala environment