09-06-2023 02:56 AM
I have a single user cluster and I have created a workflow which will read excel file from Azure storage account. For reading excel file I am using com.crealytics:spark-excel_2.13:3.4.1_0.19.0 library on single user all-purpose cluster.I have already installed this library on the cluster. Attaching a screenshot for reference.
Now what I want to do is to run the workflow on a job cluster. But I don't know how to install this maven library on job cluster. Is there something I can do with init script ? Or there might be multiple ways to install maven package on job cluster ?
So can you please help me on this ? Any help would really be appreciated.
09-06-2023 11:31 PM
Hi @Retired_mod
Thanks for replying. I tried the second approach of using init script. Apart from maven package I also installed a python package using script. I imported python package in notebook and it was installed. But when I was trying to read excel file it gave below error:
Traceback (most recent call last):
File "<command-709773507629376>", line 4, in <module>
df = (spark.read.format("com.crealytics.spark.excel") \
File "/databricks/spark/python/pyspark/instrumentation_utils.py", line 48, in wrapper
res = func(*args, **kwargs)
File "/databricks/spark/python/pyspark/sql/readwriter.py", line 302, in load
return self._df(self._jreader.load(path))
File "/databricks/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1321, in __call__
return_value = get_return_value(
File "/databricks/spark/python/pyspark/errors/exceptions.py", line 228, in deco
return f(*a, **kw)
File "/databricks/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/protocol.py", line 326, in get_return_value
raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o602.load.
: java.lang.NoSuchMethodError: scala.collection.immutable.Seq.map(Lscala/Function1;)Ljava/lang/Object;
at com.crealytics.spark.excel.Utils$MapIncluding.unapply(Utils.scala:28)
at com.crealytics.spark.excel.WorkbookReader$.apply(WorkbookReader.scala:68)
at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:39)
at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:29)
at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:24)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:382)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:378)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:334)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:334)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:240)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
at py4j.Gateway.invoke(Gateway.java:306)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:195)
at py4j.ClientServerConnection.run(ClientServerConnection.java:115)
at java.lang.Thread.run(Thread.java:750)
Below is the code in notebook that I am using:
#read the Excelfile from storage account
try:
sheet_name="LPT_Control_"+str(as_of_date)+"!A1"
df = (spark.read.format("com.crealytics.spark.excel") \
.option("header", "true") \
.option("treatEmptyValuesAsNulls", "false") \
.option("dataAddress", sheet_name) \
.options(inferSchema='True') \
.load(<path-to-storageaccount-file>))
except Exception as e:
import traceback
traceback.print_exc()
print(f"Error occurred while reading excel {e.print}")
How do I check whether the maven package is installed or not ? And is thereany configuration I need to set once the package is installed ?
Waiting to hear back from you soon.
09-19-2023 09:34 PM
Hi @Retired_mod
Any update on the above mentioned issue regarding maven package ?
09-20-2023 01:53 AM
Hi @Retired_mod
Can you ellaborate few more things:
1. When spark-shell installs any maven package, what is the default location where it downloads the jar file ?
2. As far as I know default location for jars is "/databricks/jars/" from where spark picks all the packages. So does this spark-shell installs the jar at different place ? If yes, Please suggest how can I use spark to use jars from this place ?
Waiting to hear from you soon.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group