cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Install maven package on job cluster

nikhilkumawat
New Contributor III

I have a single user cluster and I have created a workflow which will read excel file from Azure storage account. For reading excel file I am using com.crealytics:spark-excel_2.13:3.4.1_0.19.0  library on single user all-purpose cluster.I have already installed this library on the cluster. Attaching a screenshot for reference.

Now what I want to do is to run the workflow on a job cluster. But I don't know how to install this maven library on job cluster. Is there something I can do with init script ? Or there might be multiple ways to install maven package on job cluster ?

So can you please help me on this ? Any help would really be appreciated.

 

3 REPLIES 3

Hi @Retired_mod 

Thanks for replying. I tried the second approach of using init script. Apart from maven package I also installed a python package using script. I imported python package in notebook  and it was installed. But when I was trying to read excel file it gave below error:

Traceback (most recent call last):
  File "<command-709773507629376>", line 4, in <module>
    df = (spark.read.format("com.crealytics.spark.excel") \
  File "/databricks/spark/python/pyspark/instrumentation_utils.py", line 48, in wrapper
    res = func(*args, **kwargs)
  File "/databricks/spark/python/pyspark/sql/readwriter.py", line 302, in load
    return self._df(self._jreader.load(path))
  File "/databricks/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1321, in __call__
    return_value = get_return_value(
  File "/databricks/spark/python/pyspark/errors/exceptions.py", line 228, in deco
    return f(*a, **kw)
  File "/databricks/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/protocol.py", line 326, in get_return_value
    raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o602.load.
: java.lang.NoSuchMethodError: scala.collection.immutable.Seq.map(Lscala/Function1;)Ljava/lang/Object;
	at com.crealytics.spark.excel.Utils$MapIncluding.unapply(Utils.scala:28)
	at com.crealytics.spark.excel.WorkbookReader$.apply(WorkbookReader.scala:68)
	at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:39)
	at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:29)
	at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:24)
	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:382)
	at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:378)
	at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:334)
	at scala.Option.getOrElse(Option.scala:189)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:334)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:240)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
	at py4j.Gateway.invoke(Gateway.java:306)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:195)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:115)
	at java.lang.Thread.run(Thread.java:750)

Below is the code in notebook that I am using:

#read the Excelfile from storage account
try:
    sheet_name="LPT_Control_"+str(as_of_date)+"!A1"
    df = (spark.read.format("com.crealytics.spark.excel") \
                .option("header", "true") \
                .option("treatEmptyValuesAsNulls", "false") \
                .option("dataAddress", sheet_name) \
                .options(inferSchema='True') \
                .load(<path-to-storageaccount-file>))
except Exception as e:
    import traceback
    traceback.print_exc()
    print(f"Error occurred while reading excel {e.print}")

How do I check whether the maven package is installed or not ? And is thereany configuration I need to set once the package is installed ?

Waiting to hear back from you soon.

nikhilkumawat
New Contributor III

Hi @Retired_mod 

Any update on the above mentioned issue regarding maven package ? 

nikhilkumawat
New Contributor III

Hi @Retired_mod 

Can you ellaborate few more things:

1. When spark-shell installs any maven package, what is the default location where it downloads the jar file ?

2. As far as I know default location for jars is "/databricks/jars/" from where spark picks all the packages. So does this spark-shell installs the jar at different place ? If yes, Please suggest how can I use spark to use jars from this place ?

 

Waiting to hear from you soon.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group