Successfully installed Maven:Coordinates:com.crealytics:spark-excel_2.12:3.2.0_0.16.0 on Azure DBX 9.1 LTS runtime but getting error for missing dependency: org.apache.commons.io.IOUtils.byteArray(I)

dataslicer
Contributor

I am using Azure DBX 9.1 LTS and successfully installed the following library on the cluster using Maven coordinates:

com.crealytics:spark-excel_2.12:3.2.0_0.16.0

When I executed the following line:

excelSDF = spark.read.format("excel").option("dataAddress", "'Sheet1'!A1:C4").option("header", "true").option("treatEmptyValuesAsNulls", "true").option("inferSchema", "true").load(excel_sample)

I get the following exception thrown:

Py4JJavaError: An error occurred while calling o438.load.
: java.lang.NoSuchMethodError: org.apache.commons.io.IOUtils.byteArray(I)[B
	at org.apache.commons.io.output.AbstractByteArrayOutputStream.needNewBuffer(AbstractByteArrayOutputStream.java:104)
	at org.apache.commons.io.output.UnsynchronizedByteArrayOutputStream.<init>(UnsynchronizedByteArrayOutputStream.java:51)
	at shadeio.poi.util.IOUtils.peekFirstNBytes(IOUtils.java:110)
	at shadeio.poi.poifs.filesystem.FileMagic.valueOf(FileMagic.java:209)
	at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:206)
	at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:172)
	at com.crealytics.spark.v2.excel.ExcelHelper.getWorkbook(ExcelHelper.scala:107)
	at com.crealytics.spark.v2.excel.ExcelHelper.getRows(ExcelHelper.scala:122)
	at com.crealytics.spark.v2.excel.ExcelTable.infer(ExcelTable.scala:72)
	at com.crealytics.spark.v2.excel.ExcelTable.inferSchema(ExcelTable.scala:43)
	at org.apache.spark.sql.execution.datasources.v2.FileTable.$anonfun$dataSchema$4(FileTable.scala:69)
	at scala.Option.orElse(Option.scala:447)
	at org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema$lzycompute(FileTable.scala:69)
	at org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema(FileTable.scala:63)
	at org.apache.spark.sql.execution.datasources.v2.FileTable.schema$lzycompute(FileTable.scala:82)
	at org.apache.spark.sql.execution.datasources.v2.FileTable.schema(FileTable.scala:80)
	at com.crealytics.spark.v2.excel.ExcelDataSource.inferSchema(ExcelDataSource.scala:85)
	at org.apache.spark.sql.execution.datasources.v2.DataSourceV2Utils$.getTableFromProvider(DataSourceV2Utils.scala:81)
	at org.apache.spark.sql.DataFrameReader.$anonfun$load$1(DataFrameReader.scala:388)
	at scala.Option.map(Option.scala:230)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:367)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:287)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
	at py4j.Gateway.invoke(Gateway.java:295)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:251)
	at java.lang.Thread.run(Thread.java:748)

When I tried to install the following dependency library using Azure Databricks Cluster Libraries web UI using the following Maven coordinates, it failed.

org.apache.commons:commons-io:2.11.0

Questions:

  1. Is there a safe guard that Databricks is preventing the installation of this package?
  2. How can users of the `spark-excel` library address this dependency on Databricks cluster?

Thanks.

Update 01:

Update 02:

  • Another attempt tried as follows:
  • After the spark cluster has been restarted with the additional libraries (installed successfully), new error popped up complaining that org/apache/spark/sql/sources/v2/ReadSupport is not in commons-io 2.11 jar.
Py4JJavaError: An error occurred while calling o386.load.
: java.lang.NoClassDefFoundError: org/apache/spark/sql/sources/v2/ReadSupport
  •  
    •  The missing class seems to be a class packaged in spark-sql jar.
    • There seems to be some dependency weirdness with the DataSourceV2 classes.
    • The dependency nightmare seems be nested and never ending.
    • Hopefully the experts can weigh in on this.

Update 03:

  • Performed a quick search regarding DataSourceV2 and this is an API that only exists in the Spark 2.x branch. Databricks 9.1 LTS is running Spark 3.1.2. With this limited knowledge, I believe the spark-excel library is some how referring to some stale / deprecated Spark 2.x API.
    • Does anyone know how to determine which custom jar maybe still calling this old DataSourceV2 API?
      • Once that offending jar is isolated, how to overwrite it so the correct Spark API?
      • Again, not exactly 80% confident this is the root cause. Just sharing the existing hypothesis to see if some progress can be made here.

Update 04:

  • I have tried several different versions of the libraries and they all throw some sort of exceptions in different call stacks
    • com.crealytics:spark-excel_2.12:3.1.2_0.16.0
      • java.lang.NoSuchMethodError: org.apache.commons.io.IOUtils.byteArray(I)[B
    • com.crealytics:spark-excel_2.12:3.1.2_0.15.2
      • java.lang.NoSuchMethodError: org.apache.commons.io.IOUtils.byteArray(I)[B
    • com.crealytics:spark-excel_2.12:0.14.0
      • Does not throw any exception when completing this 1 line command
excelSDF = spark.read.format("excel").option("dataAddress", "'Sheet1'!A1:C4").option("header", "true").option("treatEmptyValuesAsNulls", "true").option("inferSchema", "true").load(excel_sample)
  • However, when I executed the following line of code in the next Cmd cell,
display(excelSDF)
  • I get a different exception:
NoSuchMethodError: org.apache.spark.sql.catalyst.util.FailureSafeParser.<init>(Lscala/Function1;Lorg/apache/spark/sql/catalyst/util/ParseMode;Lorg/apache/spark/sql/types/StructType;Ljava/lang/String;)V
  •  

Using the older library as suggested worked in DBR 10.4 LTS. Thank you.

On a separate note, my curiosity in understanding the changes in the underlying datasource v2 API is ongoing. 😀

Atanu
Databricks Employee
Databricks Employee

This is the library dependency. You need to exclude the dependency to get it working. @Jim Huang​ 

Thank you for providing another option to address this issue.

I have follow up questions:

  1. What should be the dependency to be excluded in this situation?
  2. How to exclude such dependency in Databricks runtime environment?
    1. Is there a reference you can provide regarding this approach?

Thanks!

RamRaju
New Contributor II

Hi @dataslicer  were you able to solve this issue?

I am using 9.1 lts databricks version with Spark 3.1.2 and scala 2.12. I have installed com.crealytics:spark-excel-2.12.17-3.1.2_2.12:3.1.2_0.18.1.  It was working fine but now facing same exception as you. Could you please help..

 

Thank you.