cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Install maven package on job cluster

nikhilkumawat
New Contributor III

I have a single user cluster and I have created a workflow which will read excel file from Azure storage account. For reading excel file I am using com.crealytics:spark-excel_2.13:3.4.1_0.19.0  library on single user all-purpose cluster.I have already installed this library on the cluster. Attaching a screenshot for reference.

Now what I want to do is to run the workflow on a job cluster. But I don't know how to install this maven library on job cluster. Is there something I can do with init script ? Or there might be multiple ways to install maven package on job cluster ?

So can you please help me on this ? Any help would really be appreciated.

 

4 REPLIES 4

Kaniz
Community Manager
Community Manager

Hi @nikhilkumawatyou can install a Maven library on a job cluster in Databricks.

There are two ways of doing this:

1. **Using the Libraries API**: You can use the Databricks Libraries API to install a library on a cluster.

Here is an example of how to install a Maven library using the Libraries API:

python
import requests

# Replace with your Databricks host and token
HOST = "https://<your-databricks-instance>"
TOKEN = "<your-access-token>"

# Replace with your cluster id
CLUSTER_ID = "<your-cluster-id>"

headers = {
   'Authorization': f'Bearer {TOKEN}',
}

data = {
 "cluster_id": CLUSTER_ID,
 "libraries": [
   {
     "maven": {
       "coordinates": "com.crealytics:spark-excel_2.13:3.4.1_0.19.0"
     }
   }
 ]
}

response = requests.post(f'{HOST}/api/2.0/libraries/install', headers=headers, json=data)

print(response.status_code)
print(response.json())

This will install the com.crealytics:spark-excel_2.13:3.4.1_0.19.0 Library on the specified cluster.

2. **Using an init script**: You can also use an init script to install a Maven library when a cluster is starting up.

Here is an example of an init script that establishes a Maven library:

bash
#!/bin/bash
/databricks/spark/bin/spark-shell --packages com.crealytics:spark-excel_2.13:3.4.1_0.19.0

You would then specify this script when creating your job cluster.

nikhilkumawat
New Contributor III

Hi @Kaniz 

Thanks for replying. I tried the second approach of using init script. Apart from maven package I also installed a python package using script. I imported python package in notebook  and it was installed. But when I was trying to read excel file it gave below error:

Traceback (most recent call last):
  File "<command-709773507629376>", line 4, in <module>
    df = (spark.read.format("com.crealytics.spark.excel") \
  File "/databricks/spark/python/pyspark/instrumentation_utils.py", line 48, in wrapper
    res = func(*args, **kwargs)
  File "/databricks/spark/python/pyspark/sql/readwriter.py", line 302, in load
    return self._df(self._jreader.load(path))
  File "/databricks/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1321, in __call__
    return_value = get_return_value(
  File "/databricks/spark/python/pyspark/errors/exceptions.py", line 228, in deco
    return f(*a, **kw)
  File "/databricks/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/protocol.py", line 326, in get_return_value
    raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o602.load.
: java.lang.NoSuchMethodError: scala.collection.immutable.Seq.map(Lscala/Function1;)Ljava/lang/Object;
	at com.crealytics.spark.excel.Utils$MapIncluding.unapply(Utils.scala:28)
	at com.crealytics.spark.excel.WorkbookReader$.apply(WorkbookReader.scala:68)
	at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:39)
	at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:29)
	at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:24)
	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:382)
	at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:378)
	at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:334)
	at scala.Option.getOrElse(Option.scala:189)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:334)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:240)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
	at py4j.Gateway.invoke(Gateway.java:306)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:195)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:115)
	at java.lang.Thread.run(Thread.java:750)

Below is the code in notebook that I am using:

#read the Excelfile from storage account
try:
    sheet_name="LPT_Control_"+str(as_of_date)+"!A1"
    df = (spark.read.format("com.crealytics.spark.excel") \
                .option("header", "true") \
                .option("treatEmptyValuesAsNulls", "false") \
                .option("dataAddress", sheet_name) \
                .options(inferSchema='True') \
                .load(<path-to-storageaccount-file>))
except Exception as e:
    import traceback
    traceback.print_exc()
    print(f"Error occurred while reading excel {e.print}")

How do I check whether the maven package is installed or not ? And is thereany configuration I need to set once the package is installed ?

Waiting to hear back from you soon.

nikhilkumawat
New Contributor III

Hi @Kaniz 

Any update on the above mentioned issue regarding maven package ? 

nikhilkumawat
New Contributor III

Hi @Kaniz 

Can you ellaborate few more things:

1. When spark-shell installs any maven package, what is the default location where it downloads the jar file ?

2. As far as I know default location for jars is "/databricks/jars/" from where spark picks all the packages. So does this spark-shell installs the jar at different place ? If yes, Please suggest how can I use spark to use jars from this place ?

 

Waiting to hear from you soon.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.