cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Successfully installed Maven:Coordinates:com.crealytics:spark-excel_2.12:3.2.0_0.16.0 on Azure DBX 9.1 LTS runtime but getting error for missing dependency: org.apache.commons.io.IOUtils.byteArray(I)

dataslicer
Contributor

I am using Azure DBX 9.1 LTS and successfully installed the following library on the cluster using Maven coordinates:

com.crealytics:spark-excel_2.12:3.2.0_0.16.0

When I executed the following line:

excelSDF = spark.read.format("excel").option("dataAddress", "'Sheet1'!A1:C4").option("header", "true").option("treatEmptyValuesAsNulls", "true").option("inferSchema", "true").load(excel_sample)

I get the following exception thrown:

Py4JJavaError: An error occurred while calling o438.load.
: java.lang.NoSuchMethodError: org.apache.commons.io.IOUtils.byteArray(I)[B
	at org.apache.commons.io.output.AbstractByteArrayOutputStream.needNewBuffer(AbstractByteArrayOutputStream.java:104)
	at org.apache.commons.io.output.UnsynchronizedByteArrayOutputStream.<init>(UnsynchronizedByteArrayOutputStream.java:51)
	at shadeio.poi.util.IOUtils.peekFirstNBytes(IOUtils.java:110)
	at shadeio.poi.poifs.filesystem.FileMagic.valueOf(FileMagic.java:209)
	at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:206)
	at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:172)
	at com.crealytics.spark.v2.excel.ExcelHelper.getWorkbook(ExcelHelper.scala:107)
	at com.crealytics.spark.v2.excel.ExcelHelper.getRows(ExcelHelper.scala:122)
	at com.crealytics.spark.v2.excel.ExcelTable.infer(ExcelTable.scala:72)
	at com.crealytics.spark.v2.excel.ExcelTable.inferSchema(ExcelTable.scala:43)
	at org.apache.spark.sql.execution.datasources.v2.FileTable.$anonfun$dataSchema$4(FileTable.scala:69)
	at scala.Option.orElse(Option.scala:447)
	at org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema$lzycompute(FileTable.scala:69)
	at org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema(FileTable.scala:63)
	at org.apache.spark.sql.execution.datasources.v2.FileTable.schema$lzycompute(FileTable.scala:82)
	at org.apache.spark.sql.execution.datasources.v2.FileTable.schema(FileTable.scala:80)
	at com.crealytics.spark.v2.excel.ExcelDataSource.inferSchema(ExcelDataSource.scala:85)
	at org.apache.spark.sql.execution.datasources.v2.DataSourceV2Utils$.getTableFromProvider(DataSourceV2Utils.scala:81)
	at org.apache.spark.sql.DataFrameReader.$anonfun$load$1(DataFrameReader.scala:388)
	at scala.Option.map(Option.scala:230)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:367)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:287)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
	at py4j.Gateway.invoke(Gateway.java:295)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:251)
	at java.lang.Thread.run(Thread.java:748)

When I tried to install the following dependency library using Azure Databricks Cluster Libraries web UI using the following Maven coordinates, it failed.

org.apache.commons:commons-io:2.11.0

Questions:

  1. Is there a safe guard that Databricks is preventing the installation of this package?
  2. How can users of the `spark-excel` library address this dependency on Databricks cluster?

Thanks.

Update 01:

Update 02:

  • Another attempt tried as follows:
  • After the spark cluster has been restarted with the additional libraries (installed successfully), new error popped up complaining that org/apache/spark/sql/sources/v2/ReadSupport is not in commons-io 2.11 jar.
Py4JJavaError: An error occurred while calling o386.load.
: java.lang.NoClassDefFoundError: org/apache/spark/sql/sources/v2/ReadSupport
  •  
    •  The missing class seems to be a class packaged in spark-sql jar.
    • There seems to be some dependency weirdness with the DataSourceV2 classes.
    • The dependency nightmare seems be nested and never ending.
    • Hopefully the experts can weigh in on this.

Update 03:

  • Performed a quick search regarding DataSourceV2 and this is an API that only exists in the Spark 2.x branch. Databricks 9.1 LTS is running Spark 3.1.2. With this limited knowledge, I believe the spark-excel library is some how referring to some stale / deprecated Spark 2.x API.
    • Does anyone know how to determine which custom jar maybe still calling this old DataSourceV2 API?
      • Once that offending jar is isolated, how to overwrite it so the correct Spark API?
      • Again, not exactly 80% confident this is the root cause. Just sharing the existing hypothesis to see if some progress can be made here.

Update 04:

  • I have tried several different versions of the libraries and they all throw some sort of exceptions in different call stacks
    • com.crealytics:spark-excel_2.12:3.1.2_0.16.0
      • java.lang.NoSuchMethodError: org.apache.commons.io.IOUtils.byteArray(I)[B
    • com.crealytics:spark-excel_2.12:3.1.2_0.15.2
      • java.lang.NoSuchMethodError: org.apache.commons.io.IOUtils.byteArray(I)[B
    • com.crealytics:spark-excel_2.12:0.14.0
      • Does not throw any exception when completing this 1 line command
excelSDF = spark.read.format("excel").option("dataAddress", "'Sheet1'!A1:C4").option("header", "true").option("treatEmptyValuesAsNulls", "true").option("inferSchema", "true").load(excel_sample)
  • However, when I executed the following line of code in the next Cmd cell,
display(excelSDF)
  • I get a different exception:
NoSuchMethodError: org.apache.spark.sql.catalyst.util.FailureSafeParser.<init>(Lscala/Function1;Lorg/apache/spark/sql/catalyst/util/ParseMode;Lorg/apache/spark/sql/types/StructType;Ljava/lang/String;)V
  •  

4 REPLIES 4

Using the older library as suggested worked in DBR 10.4 LTS. Thank you.

On a separate note, my curiosity in understanding the changes in the underlying datasource v2 API is ongoing. ๐Ÿ˜€

Atanu
Databricks Employee
Databricks Employee

This is the library dependency. You need to exclude the dependency to get it working. @Jim Huangโ€‹ 

Thank you for providing another option to address this issue.

I have follow up questions:

  1. What should be the dependency to be excluded in this situation?
  2. How to exclude such dependency in Databricks runtime environment?
    1. Is there a reference you can provide regarding this approach?

Thanks!

RamRaju
New Contributor II

Hi @dataslicer  were you able to solve this issue?

I am using 9.1 lts databricks version with Spark 3.1.2 and scala 2.12. I have installed com.crealytics:spark-excel-2.12.17-3.1.2_2.12:3.1.2_0.18.1.  It was working fine but now facing same exception as you. Could you please help..

 

Thank you.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group