Successfully installed Maven:Coordinates:com.creal...

dataslicer · ‎02-04-2022

I am using Azure DBX 9.1 LTS and successfully installed the following library on the cluster using Maven coordinates:

com.crealytics:spark-excel_2.12:3.2.0_0.16.0

When I executed the following line:

excelSDF = spark.read.format("excel").option("dataAddress", "'Sheet1'!A1:C4").option("header", "true").option("treatEmptyValuesAsNulls", "true").option("inferSchema", "true").load(excel_sample)

I get the following exception thrown:

Py4JJavaError: An error occurred while calling o438.load.
: java.lang.NoSuchMethodError: org.apache.commons.io.IOUtils.byteArray(I)[B
	at org.apache.commons.io.output.AbstractByteArrayOutputStream.needNewBuffer(AbstractByteArrayOutputStream.java:104)
	at org.apache.commons.io.output.UnsynchronizedByteArrayOutputStream.<init>(UnsynchronizedByteArrayOutputStream.java:51)
	at shadeio.poi.util.IOUtils.peekFirstNBytes(IOUtils.java:110)
	at shadeio.poi.poifs.filesystem.FileMagic.valueOf(FileMagic.java:209)
	at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:206)
	at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:172)
	at com.crealytics.spark.v2.excel.ExcelHelper.getWorkbook(ExcelHelper.scala:107)
	at com.crealytics.spark.v2.excel.ExcelHelper.getRows(ExcelHelper.scala:122)
	at com.crealytics.spark.v2.excel.ExcelTable.infer(ExcelTable.scala:72)
	at com.crealytics.spark.v2.excel.ExcelTable.inferSchema(ExcelTable.scala:43)
	at org.apache.spark.sql.execution.datasources.v2.FileTable.$anonfun$dataSchema$4(FileTable.scala:69)
	at scala.Option.orElse(Option.scala:447)
	at org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema$lzycompute(FileTable.scala:69)
	at org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema(FileTable.scala:63)
	at org.apache.spark.sql.execution.datasources.v2.FileTable.schema$lzycompute(FileTable.scala:82)
	at org.apache.spark.sql.execution.datasources.v2.FileTable.schema(FileTable.scala:80)
	at com.crealytics.spark.v2.excel.ExcelDataSource.inferSchema(ExcelDataSource.scala:85)
	at org.apache.spark.sql.execution.datasources.v2.DataSourceV2Utils$.getTableFromProvider(DataSourceV2Utils.scala:81)
	at org.apache.spark.sql.DataFrameReader.$anonfun$load$1(DataFrameReader.scala:388)
	at scala.Option.map(Option.scala:230)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:367)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:287)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
	at py4j.Gateway.invoke(Gateway.java:295)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:251)
	at java.lang.Thread.run(Thread.java:748)

When I tried to install the following dependency library using Azure Databricks Cluster Libraries web UI using the following Maven coordinates, it failed.

org.apache.commons:commons-io:2.11.0

Questions:

Is there a safe guard that Databricks is preventing the installation of this package?
How can users of the `spark-excel` library address this dependency on Databricks cluster?

Thanks.

Update 01:

This seems to be a known open issue that others in the community are also facing.
- https://github.com/crealytics/spark-excel/issues/467
The temporary work around from that thread is to revert back to Data Source API v1.0
The desire goal is to utilize Data Source API v2.0.

Update 02:

Another attempt tried as follows:
- Downloaded the binary (commons-io-2.11.0-bin.tar.gz) and extracted the jar directly from Apache Commons
- Uploaded the downloaded jar to Azure Databricks spark cluster library as JAR
After the spark cluster has been restarted with the additional libraries (installed successfully), new error popped up complaining that org/apache/spark/sql/sources/v2/ReadSupport is not in commons-io 2.11 jar.

Py4JJavaError: An error occurred while calling o386.load.
: java.lang.NoClassDefFoundError: org/apache/spark/sql/sources/v2/ReadSupport

- The missing class seems to be a class packaged in spark-sql jar.
- There seems to be some dependency weirdness with the DataSourceV2 classes.
- The dependency nightmare seems be nested and never ending.
- Hopefully the experts can weigh in on this.

Update 03:

Performed a quick search regarding DataSourceV2 and this is an API that only exists in the Spark 2.x branch. Databricks 9.1 LTS is running Spark 3.1.2. With this limited knowledge, I believe the spark-excel library is some how referring to some stale / deprecated Spark 2.x API.
- Does anyone know how to determine which custom jar maybe still calling this old DataSourceV2 API?
  - Once that offending jar is isolated, how to overwrite it so the correct Spark API?
  - Again, not exactly 80% confident this is the root cause. Just sharing the existing hypothesis to see if some progress can be made here.

Update 04:

I have tried several different versions of the libraries and they all throw some sort of exceptions in different call stacks
- com.crealytics:spark-excel_2.12:3.1.2_0.16.0
  - java.lang.NoSuchMethodError: org.apache.commons.io.IOUtils.byteArray(I)[B
- com.crealytics:spark-excel_2.12:3.1.2_0.15.2
  - java.lang.NoSuchMethodError: org.apache.commons.io.IOUtils.byteArray(I)[B
- com.crealytics:spark-excel_2.12:0.14.0
  - Does not throw any exception when completing this 1 line command

excelSDF = spark.read.format("excel").option("dataAddress", "'Sheet1'!A1:C4").option("header", "true").option("treatEmptyValuesAsNulls", "true").option("inferSchema", "true").load(excel_sample)

However, when I executed the following line of code in the next Cmd cell,

display(excelSDF)

I get a different exception:

NoSuchMethodError: org.apache.spark.sql.catalyst.util.FailureSafeParser.<init>(Lscala/Function1;Lorg/apache/spark/sql/catalyst/util/ParseMode;Lorg/apache/spark/sql/types/StructType;Ljava/lang/String;)V

dataslicer · ‎04-15-2022

Using the older library as suggested worked in DBR 10.4 LTS. Thank you.

On a separate note, my curiosity in understanding the changes in the underlying datasource v2 API is ongoing. 😀

Atanu · ‎03-15-2022

This is the library dependency. You need to exclude the dependency to get it working. @Jim Huang

dataslicer · ‎04-14-2022

Thank you for providing another option to address this issue.

I have follow up questions:

What should be the dependency to be excluded in this situation?
How to exclude such dependency in Databricks runtime environment?
1. Is there a reference you can provide regarding this approach?

Thanks!

RamRaju · ‎11-09-2023

Hi @dataslicer were you able to solve this issue?

I am using 9.1 lts databricks version with Spark 3.1.2 and scala 2.12. I have installed com.crealytics:spark-excel-2.12.17-3.1.2_2.12:3.1.2_0.18.1. It was working fine but now facing same exception as you. Could you please help..

Thank you.

Successfully installed Maven:Coordinates:com.crealytics:spark-excel_2.12:3.2.0_0.16.0 on Azure DBX 9.1 LTS runtime but getting error for missing dependency: org.apache.commons.io.IOUtils.byteArray(I)