cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Install maven package on job cluster

nikhilkumawat
New Contributor III

I have a single user cluster and I have created a workflow which will read excel file from Azure storage account. For reading excel file I am using com.crealytics:spark-excel_2.13:3.4.1_0.19.0  library on single user all-purpose cluster.I have already installed this library on the cluster. Attaching a screenshot for reference.

Now what I want to do is to run the workflow on a job cluster. But I don't know how to install this maven library on job cluster. Is there something I can do with init script ? Or there might be multiple ways to install maven package on job cluster ?

So can you please help me on this ? Any help would really be appreciated.

 

4 REPLIES 4

Kaniz_Fatma
Community Manager
Community Manager

Hi @nikhilkumawatyou can install a Maven library on a job cluster in Databricks.

There are two ways of doing this:

1. **Using the Libraries API**: You can use the Databricks Libraries API to install a library on a cluster.

Here is an example of how to install a Maven library using the Libraries API:

python
import requests

# Replace with your Databricks host and token
HOST = "https://<your-databricks-instance>"
TOKEN = "<your-access-token>"

# Replace with your cluster id
CLUSTER_ID = "<your-cluster-id>"

headers = {
   'Authorization': f'Bearer {TOKEN}',
}

data = {
 "cluster_id": CLUSTER_ID,
 "libraries": [
   {
     "maven": {
       "coordinates": "com.crealytics:spark-excel_2.13:3.4.1_0.19.0"
     }
   }
 ]
}

response = requests.post(f'{HOST}/api/2.0/libraries/install', headers=headers, json=data)

print(response.status_code)
print(response.json())

This will install the com.crealytics:spark-excel_2.13:3.4.1_0.19.0 Library on the specified cluster.

2. **Using an init script**: You can also use an init script to install a Maven library when a cluster is starting up.

Here is an example of an init script that establishes a Maven library:

bash
#!/bin/bash
/databricks/spark/bin/spark-shell --packages com.crealytics:spark-excel_2.13:3.4.1_0.19.0

You would then specify this script when creating your job cluster.

Hi @Kaniz_Fatma 

Thanks for replying. I tried the second approach of using init script. Apart from maven package I also installed a python package using script. I imported python package in notebook  and it was installed. But when I was trying to read excel file it gave below error:

Traceback (most recent call last):
  File "<command-709773507629376>", line 4, in <module>
    df = (spark.read.format("com.crealytics.spark.excel") \
  File "/databricks/spark/python/pyspark/instrumentation_utils.py", line 48, in wrapper
    res = func(*args, **kwargs)
  File "/databricks/spark/python/pyspark/sql/readwriter.py", line 302, in load
    return self._df(self._jreader.load(path))
  File "/databricks/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1321, in __call__
    return_value = get_return_value(
  File "/databricks/spark/python/pyspark/errors/exceptions.py", line 228, in deco
    return f(*a, **kw)
  File "/databricks/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/protocol.py", line 326, in get_return_value
    raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o602.load.
: java.lang.NoSuchMethodError: scala.collection.immutable.Seq.map(Lscala/Function1;)Ljava/lang/Object;
	at com.crealytics.spark.excel.Utils$MapIncluding.unapply(Utils.scala:28)
	at com.crealytics.spark.excel.WorkbookReader$.apply(WorkbookReader.scala:68)
	at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:39)
	at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:29)
	at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:24)
	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:382)
	at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:378)
	at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:334)
	at scala.Option.getOrElse(Option.scala:189)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:334)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:240)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
	at py4j.Gateway.invoke(Gateway.java:306)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:195)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:115)
	at java.lang.Thread.run(Thread.java:750)

Below is the code in notebook that I am using:

#read the Excelfile from storage account
try:
    sheet_name="LPT_Control_"+str(as_of_date)+"!A1"
    df = (spark.read.format("com.crealytics.spark.excel") \
                .option("header", "true") \
                .option("treatEmptyValuesAsNulls", "false") \
                .option("dataAddress", sheet_name) \
                .options(inferSchema='True') \
                .load(<path-to-storageaccount-file>))
except Exception as e:
    import traceback
    traceback.print_exc()
    print(f"Error occurred while reading excel {e.print}")

How do I check whether the maven package is installed or not ? And is thereany configuration I need to set once the package is installed ?

Waiting to hear back from you soon.

nikhilkumawat
New Contributor III

Hi @Kaniz_Fatma 

Any update on the above mentioned issue regarding maven package ? 

nikhilkumawat
New Contributor III

Hi @Kaniz_Fatma 

Can you ellaborate few more things:

1. When spark-shell installs any maven package, what is the default location where it downloads the jar file ?

2. As far as I know default location for jars is "/databricks/jars/" from where spark picks all the packages. So does this spark-shell installs the jar at different place ? If yes, Please suggest how can I use spark to use jars from this place ?

 

Waiting to hear from you soon.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group