Re: Install maven package on job cluster

nikhilkumawat · ‎09-06-2023

I have a single user cluster and I have created a workflow which will read excel file from Azure storage account. For reading excel file I am using com.crealytics:spark-excel_2.13:3.4.1_0.19.0 library on single user all-purpose cluster.I have already installed this library on the cluster. Attaching a screenshot for reference.

Now what I want to do is to run the workflow on a job cluster. But I don't know how to install this maven library on job cluster. Is there something I can do with init script ? Or there might be multiple ways to install maven package on job cluster ?

So can you please help me on this ? Any help would really be appreciated.

nikhilkumawat · ‎09-06-2023

Hi @Retired_mod

Thanks for replying. I tried the second approach of using init script. Apart from maven package I also installed a python package using script. I imported python package in notebook and it was installed. But when I was trying to read excel file it gave below error:

Traceback (most recent call last):
  File "<command-709773507629376>", line 4, in <module>
    df = (spark.read.format("com.crealytics.spark.excel") \
  File "/databricks/spark/python/pyspark/instrumentation_utils.py", line 48, in wrapper
    res = func(*args, **kwargs)
  File "/databricks/spark/python/pyspark/sql/readwriter.py", line 302, in load
    return self._df(self._jreader.load(path))
  File "/databricks/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1321, in __call__
    return_value = get_return_value(
  File "/databricks/spark/python/pyspark/errors/exceptions.py", line 228, in deco
    return f(*a, **kw)
  File "/databricks/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/protocol.py", line 326, in get_return_value
    raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o602.load.
: java.lang.NoSuchMethodError: scala.collection.immutable.Seq.map(Lscala/Function1;)Ljava/lang/Object;
	at com.crealytics.spark.excel.Utils$MapIncluding.unapply(Utils.scala:28)
	at com.crealytics.spark.excel.WorkbookReader$.apply(WorkbookReader.scala:68)
	at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:39)
	at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:29)
	at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:24)
	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:382)
	at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:378)
	at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:334)
	at scala.Option.getOrElse(Option.scala:189)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:334)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:240)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
	at py4j.Gateway.invoke(Gateway.java:306)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:195)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:115)
	at java.lang.Thread.run(Thread.java:750)

Below is the code in notebook that I am using:

#read the Excelfile from storage account
try:
    sheet_name="LPT_Control_"+str(as_of_date)+"!A1"
    df = (spark.read.format("com.crealytics.spark.excel") \
                .option("header", "true") \
                .option("treatEmptyValuesAsNulls", "false") \
                .option("dataAddress", sheet_name) \
                .options(inferSchema='True') \
                .load(<path-to-storageaccount-file>))
except Exception as e:
    import traceback
    traceback.print_exc()
    print(f"Error occurred while reading excel {e.print}")

How do I check whether the maven package is installed or not ? And is thereany configuration I need to set once the package is installed ?

Waiting to hear back from you soon.

nikhilkumawat · ‎09-19-2023

Hi @Retired_mod

Any update on the above mentioned issue regarding maven package ?

nikhilkumawat · ‎09-20-2023

Hi @Retired_mod

Can you ellaborate few more things:

1. When spark-shell installs any maven package, what is the default location where it downloads the jar file ?

2. As far as I know default location for jars is "/databricks/jars/" from where spark picks all the packages. So does this spark-shell installs the jar at different place ? If yes, Please suggest how can I use spark to use jars from this place ?

Waiting to hear from you soon.