cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Getting Spark & Scala version in Cluster node initialization script

ahuarte
New Contributor III

Hi there,

I am developing a Cluster node initialization script (https://docs.gcp.databricks.com/clusters/init-scripts.html#environment-variables) in order to install some custom libraries.

Reading the docs of Databricks we can get some environment variables with data related with the current running cluster node.

But I need to figure out what Spark & Scala version is currently been deployed. Is this possible?

Thanks in advance

Regards

1 ACCEPTED SOLUTION

Accepted Solutions

sean_owen
Honored Contributor II
Honored Contributor II

Hm, this is a hacky idea, maybe there is a better way, but you could

ls /databricks/jars/spark*

and parse the results to get the version of Spark and Scala. You'll see files like spark--command--command-spark_3.1_2.12_deploy.jar containing the versions.

View solution in original post

18 REPLIES 18

Kaniz
Community Manager
Community Manager

Hi @A Huarteโ€‹ ! My name is Kaniz, and I'm the technical moderator here. Great to meet you, and thanks for your question! Let's see if your peers in the community have an answer to your question first. Or else I will get back to you soon. Thanks.

ahuarte
New Contributor III

Hi Kaniz, thank you very much. For sure I will learn very much in this forum.

Prabakar
Esteemed Contributor III
Esteemed Contributor III

Hi @A Huarteโ€‹ you can get the spark and scala version from the DBR that you will be using on the cluster.

image

Prabakar
Esteemed Contributor III
Esteemed Contributor III

image

ahuarte
New Contributor III

Hi @Prabakar Ammeappinโ€‹ Thank you very much for your response,

but I mean how I can get this info in a script. I am trying to develop this sh init script for several Clusters with different Databricks runtimes.

I tried it searching files in that script but I did not find any "*spark*.jar" file from where to extract the current version of the runtime (Spark & Scala version).

When the cluster is already started there are files with this pattern, but in the moment that the init script is executed it seems that pyspark is not installed yet.

ahuarte
New Contributor III

I know that Databricks CLI tool is available, but it is not configured when the init script is running.

sean_owen
Honored Contributor II
Honored Contributor II

Hm, this is a hacky idea, maybe there is a better way, but you could

ls /databricks/jars/spark*

and parse the results to get the version of Spark and Scala. You'll see files like spark--command--command-spark_3.1_2.12_deploy.jar containing the versions.

ahuarte
New Contributor III

Hi @Sean Owenโ€‹ thanks four your reply,

your idea can work, but unfortunatelly there is any filename with the full version name. I am missing the minor part:

yyyyyy_spark_3.2_2.12_xxxxx.jar -> Spark version is really 3.2.0

I have configured databricks CLI to get metadata of the cluster and I get this output:

{

"cluster_id": "XXXXXXXXX",

"spark_context_id": YYYYYYYYYYYY,

"cluster_name": "Devel - Geospatial",

"spark_version": "10.1.x-cpu-ml-scala2.12", ##<------!!!!

....

}

"spark_version" property does not contain info about the spark version but about the DBR :-(, any thoughts?

Thanks in advance

regards

Alvaro

sean_owen
Honored Contributor II
Honored Contributor II

Do you need such specific Spark version info, why? should not matter for user applications

ahuarte
New Contributor III

sean_owen
Honored Contributor II
Honored Contributor II

I doubt it's sensitive to a minor release, why?

But you also control what DBR/Spark version you launch the cluster with

ahuarte
New Contributor III

Many thanks @Sean Owenโ€‹ I am going to apply your advice, I am not going to write a generic init script that figures out everything, but a specific version of it for each Cluster type, really we only have 3 DBR types.

Thank you very much for your support

Regards

Anonymous
Not applicable

@A Huarteโ€‹ - How did it go?

ahuarte
New Contributor III

Hi,

My idea was to deploy Geomesa or Rasterframes on Databricks in order to provide spatial capabilities to this platofrm. Finally, according to some advices in Rasterframes Gitter chat I selected the DBR 9.0 where I am installing pyrasterframes 0.10.0 via "pip" and no getting any errors.

I hope this info can be help.

Regards