Install maven package on job cluster

nikhilkumawat — Wed, 06 Sep 2023 09:56:53 GMT

I have a single user cluster and I have created a workflow which will read excel file from Azure storage account. For reading excel file I am using com.crealytics:spark-excel_2.13:3.4.1_0.19.0 library on single user all-purpose cluster.I have already installed this library on the cluster. Attaching a screenshot for reference.

Now what I want to do is to run the workflow on a job cluster. But I don't know how to install this maven library on job cluster. Is there something I can do with init script ? Or there might be multiple ways to install maven package on job cluster ?

So can you please help me on this ? Any help would really be appreciated.

Re: Install maven package on job cluster

nikhilkumawat — Thu, 07 Sep 2023 06:31:57 GMT

Hi @Retired_mod

Thanks for replying. I tried the second approach of using init script. Apart from maven package I also installed a python package using script. I imported python package in notebook and it was installed. But when I was trying to read excel file it gave below error:

Traceback (most recent call last): File "<command-709773507629376>", line 4, in <module> df = (spark.read.format("com.crealytics.spark.excel") \ File "/databricks/spark/python/pyspark/instrumentation_utils.py", line 48, in wrapper res = func(*args, **kwargs) File "/databricks/spark/python/pyspark/sql/readwriter.py", line 302, in load return self._df(self._jreader.load(path)) File "/databricks/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1321, in __call__ return_value = get_return_value( File "/databricks/spark/python/pyspark/errors/exceptions.py", line 228, in deco return f(*a, **kw) File "/databricks/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/protocol.py", line 326, in get_return_value raise Py4JJavaError( py4j.protocol.Py4JJavaError: An error occurred while calling o602.load. : java.lang.NoSuchMethodError: scala.collection.immutable.Seq.map(Lscala/Function1;)Ljava/lang/Object; at com.crealytics.spark.excel.Utils$MapIncluding.unapply(Utils.scala:28) at com.crealytics.spark.excel.WorkbookReader$.apply(WorkbookReader.scala:68) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:39) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:29) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:24) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:382) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:378) at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:334) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:334) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:240) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380) at py4j.Gateway.invoke(Gateway.java:306) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:195) at py4j.ClientServerConnection.run(ClientServerConnection.java:115) at java.lang.Thread.run(Thread.java:750)

Below is the code in notebook that I am using:

#read the Excelfile from storage account try: sheet_name="LPT_Control_"+str(as_of_date)+"!A1" df = (spark.read.format("com.crealytics.spark.excel") \ .option("header", "true") \ .option("treatEmptyValuesAsNulls", "false") \ .option("dataAddress", sheet_name) \ .options(inferSchema='True') \ .load(<path-to-storageaccount-file>)) except Exception as e: import traceback traceback.print_exc() print(f"Error occurred while reading excel {e.print}")

How do I check whether the maven package is installed or not ? And is thereany configuration I need to set once the package is installed ?

Waiting to hear back from you soon.

Re: Install maven package on job cluster

nikhilkumawat — Wed, 20 Sep 2023 04:34:24 GMT

Hi @Retired_mod

Any update on the above mentioned issue regarding maven package ?

Re: Install maven package on job cluster

nikhilkumawat — Wed, 20 Sep 2023 08:53:02 GMT

Hi @Retired_mod

Can you ellaborate few more things:

1. When spark-shell installs any maven package, what is the default location where it downloads the jar file ?

2. As far as I know default location for jars is "/databricks/jars/" from where spark picks all the packages. So does this spark-shell installs the jar at different place ? If yes, Please suggest how can I use spark to use jars from this place ?

Waiting to hear from you soon.

topic Re: Install maven package on job cluster in Data Engineering

Install maven package on job cluster

Re: Install maven package on job cluster

Re: Install maven package on job cluster

Re: Install maven package on job cluster