Hello,
I'am having an issue where I have :
- A local machine in WSL 1,
- Python 3.8 and 3.10
- OpenJDK 19.0.1 (version "build 19.0.1+10-21")
- Compute Instance In Azure Machine Learning
- Python 3.8
- OpenJDK 8 (version "1.8.0_392")
- Compute Cluster in Azure Machine Learning with custom Dockerfile
- Python 3.10
- OpenJDK 19.0.1 (version "build 19.0.1+10-21")
And I can not acces/launch my pyspark in compute cluster where others I can. Here is how I install OpenJDK in Compute Cluster (dockerfile) + local WSL:
RUN wget https://download.java.net/java/GA/jdk19.0.1/afdd2e245b014143b62ccb916125e3ce/10/GPL/openjdk-19.0.1_linux-x64_bin.tar.gz \
&& tar xvf openjdk-19.0.1_linux-x64_bin.tar.gz \
&& mv jdk-19.0.1 /opt/ \
&& rm openjdk-19.0.1_linux-x64_bin.tar.gz
ENV JAVA_HOME /opt/jdk-19.0.1
ENV PATH="${PATH}:$JAVA_HOME/bin"
In both of them I have this output of `java --version` to:
openjdk 19.0.1 2022-10-18
OpenJDK Runtime Environment (build 19.0.1+10-21)
OpenJDK 64-Bit Server VM (build 19.0.1+10-21, mixed mode, sharing)
I did not installed OpenJDK 8 on the compute instance, it was preinstalled by Azure in the VM.
Both Compute Instance and Compute Cluster are in the same subnet in Azure, so they don't have network issue to access databricks (all private endpoints are working).
Here is the error I have when launching a simple spark command in Compute Cluster:
Exception in thread "main" java.lang.ExceptionInInitializerError
at org.apache.spark.deploy.SparkSubmitArguments.$anonfun$loadEnvironmentArguments$5(SparkSubmitArguments.scala:163)
at scala.Option.orElse(Option.scala:447)
at org.apache.spark.deploy.SparkSubmitArguments.loadEnvironmentArguments(SparkSubmitArguments.scala:163)
at org.apache.spark.deploy.SparkSubmitArguments.<init>(SparkSubmitArguments.scala:118)
at org.apache.spark.deploy.SparkSubmit$$anon$2$$anon$3.<init>(SparkSubmit.scala:1046)
at org.apache.spark.deploy.SparkSubmit$$anon$2.parseArguments(SparkSubmit.scala:1046)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:85)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1063)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1072)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.net.UnknownHostException: 019325cc430b495e91604bf9052029ac000000: 019325cc430b495e91604bf9052029ac000000: Name or service not known
at java.base/java.net.InetAddress.getLocalHost(InetAddress.java:1776)
at org.apache.spark.util.Utils$.findLocalInetAddress(Utils.scala:1211)
at org.apache.spark.util.Utils$.localIpAddress$lzycompute(Utils.scala:1204)
at org.apache.spark.util.Utils$.localIpAddress(Utils.scala:1204)
at org.apache.spark.util.Utils$.$anonfun$localCanonicalHostName$1(Utils.scala:1261)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.util.Utils$.localCanonicalHostName(Utils.scala:1261)
at org.apache.spark.internal.config.package$.<init>(package.scala:1080)
at org.apache.spark.internal.config.package$.<clinit>(package.scala)
... 10 more
Caused by: java.net.UnknownHostException: 019325cc430b495e91604bf9052029ac000000: Name or service not known
at java.base/java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
at java.base/java.net.Inet6AddressImpl.lookupAllHostAddr(Inet6AddressImpl.java:52)
at java.base/java.net.InetAddress$PlatformResolver.lookupByName(InetAddress.java:1059)
at java.base/java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1668)
at java.base/java.net.InetAddress$NameServiceAddresses.get(InetAddress.java:1003)
at java.base/java.net.InetAddress.getAllByName0(InetAddress.java:1658)
at java.base/java.net.InetAddress.getLocalHost(InetAddress.java:1771)
... 18 more
From the compute cluster, I can curl to the databricks API to genetate a Personnal Access Token.
I also did a class that will automatically generate an Oauth2 token from Azure then use it to generate a databricks PAT then set up "databricks-connect":
stdin_list = [
"https://" + settings.databricks_address,
DatabricksTokenManager(settings.databricks_address).pat,
settings.databricks_cluster_id,
settings.databricks_org_id,
str(settings.databricks_port),
]
stdin_string = "\n".join(stdin_list)
with subprocess.Popen(
(["echo", "-e", stdin_string]), stdout=subprocess.PIPE
) as echo:
subprocess.check_output(
("databricks-connect", "configure"), stdin=echo.stdout
)
echo.wait()
settings.databricks_address have a string if this format "adb-xxxxxxxxxxxx.x.azuredatabricks.net/"
settings.databricks_cluster_id is taken from the databricks URL and a cluster, same for the organisation id and port.
{
"token": "dapixxxxxxxxxxxxxxxxxxxxxxx-2",
"cluster_id": "0119-xxxxxx-xxxxxxx",
"org_id": "542xxxxxxxxxxx",
"port": "15001"
}
So I can not understand why it is working everywhere expect compute cluster with the same configuration of python code and OpenJDK/python version.