Successfully installed Maven:Coordinates:com.crealytics:spark-excel_2.12:3.2.0_0.16.0 on Azure DBX 9.1 LTS runtime but getting error for missing dependency: org.apache.commons.io.IOUtils.byteArray(I)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-04-2022 01:23 PM
I am using Azure DBX 9.1 LTS and successfully installed the following library on the cluster using Maven coordinates:
com.crealytics:spark-excel_2.12:3.2.0_0.16.0
When I executed the following line:
excelSDF = spark.read.format("excel").option("dataAddress", "'Sheet1'!A1:C4").option("header", "true").option("treatEmptyValuesAsNulls", "true").option("inferSchema", "true").load(excel_sample)
I get the following exception thrown:
Py4JJavaError: An error occurred while calling o438.load.
: java.lang.NoSuchMethodError: org.apache.commons.io.IOUtils.byteArray(I)[B
at org.apache.commons.io.output.AbstractByteArrayOutputStream.needNewBuffer(AbstractByteArrayOutputStream.java:104)
at org.apache.commons.io.output.UnsynchronizedByteArrayOutputStream.<init>(UnsynchronizedByteArrayOutputStream.java:51)
at shadeio.poi.util.IOUtils.peekFirstNBytes(IOUtils.java:110)
at shadeio.poi.poifs.filesystem.FileMagic.valueOf(FileMagic.java:209)
at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:206)
at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:172)
at com.crealytics.spark.v2.excel.ExcelHelper.getWorkbook(ExcelHelper.scala:107)
at com.crealytics.spark.v2.excel.ExcelHelper.getRows(ExcelHelper.scala:122)
at com.crealytics.spark.v2.excel.ExcelTable.infer(ExcelTable.scala:72)
at com.crealytics.spark.v2.excel.ExcelTable.inferSchema(ExcelTable.scala:43)
at org.apache.spark.sql.execution.datasources.v2.FileTable.$anonfun$dataSchema$4(FileTable.scala:69)
at scala.Option.orElse(Option.scala:447)
at org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema$lzycompute(FileTable.scala:69)
at org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema(FileTable.scala:63)
at org.apache.spark.sql.execution.datasources.v2.FileTable.schema$lzycompute(FileTable.scala:82)
at org.apache.spark.sql.execution.datasources.v2.FileTable.schema(FileTable.scala:80)
at com.crealytics.spark.v2.excel.ExcelDataSource.inferSchema(ExcelDataSource.scala:85)
at org.apache.spark.sql.execution.datasources.v2.DataSourceV2Utils$.getTableFromProvider(DataSourceV2Utils.scala:81)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$1(DataFrameReader.scala:388)
at scala.Option.map(Option.scala:230)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:367)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:287)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
at py4j.Gateway.invoke(Gateway.java:295)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:251)
at java.lang.Thread.run(Thread.java:748)
When I tried to install the following dependency library using Azure Databricks Cluster Libraries web UI using the following Maven coordinates, it failed.
org.apache.commons:commons-io:2.11.0
Questions:
- Is there a safe guard that Databricks is preventing the installation of this package?
- How can users of the `spark-excel` library address this dependency on Databricks cluster?
Thanks.
Update 01:
- This seems to be a known open issue that others in the community are also facing.
- The temporary work around from that thread is to revert back to Data Source API v1.0
- The desire goal is to utilize Data Source API v2.0.
Update 02:
- Another attempt tried as follows:
- Downloaded the binary (commons-io-2.11.0-bin.tar.gz) and extracted the jar directly from Apache Commons
- Uploaded the downloaded jar to Azure Databricks spark cluster library as JAR
- After the spark cluster has been restarted with the additional libraries (installed successfully), new error popped up complaining that org/apache/spark/sql/sources/v2/ReadSupport is not in commons-io 2.11 jar.
Py4JJavaError: An error occurred while calling o386.load.
: java.lang.NoClassDefFoundError: org/apache/spark/sql/sources/v2/ReadSupport
-
- The missing class seems to be a class packaged in spark-sql jar.
- There seems to be some dependency weirdness with the DataSourceV2 classes.
- The dependency nightmare seems be nested and never ending.
- Hopefully the experts can weigh in on this.
Update 03:
- Performed a quick search regarding DataSourceV2 and this is an API that only exists in the Spark 2.x branch. Databricks 9.1 LTS is running Spark 3.1.2. With this limited knowledge, I believe the spark-excel library is some how referring to some stale / deprecated Spark 2.x API.
- Does anyone know how to determine which custom jar maybe still calling this old DataSourceV2 API?
- Once that offending jar is isolated, how to overwrite it so the correct Spark API?
- Again, not exactly 80% confident this is the root cause. Just sharing the existing hypothesis to see if some progress can be made here.
- Does anyone know how to determine which custom jar maybe still calling this old DataSourceV2 API?
Update 04:
- I have tried several different versions of the libraries and they all throw some sort of exceptions in different call stacks
- com.crealytics:spark-excel_2.12:3.1.2_0.16.0
- java.lang.NoSuchMethodError: org.apache.commons.io.IOUtils.byteArray(I)[B
- com.crealytics:spark-excel_2.12:3.1.2_0.15.2
- java.lang.NoSuchMethodError: org.apache.commons.io.IOUtils.byteArray(I)[B
- com.crealytics:spark-excel_2.12:0.14.0
- Does not throw any exception when completing this 1 line command
- com.crealytics:spark-excel_2.12:3.1.2_0.16.0
excelSDF = spark.read.format("excel").option("dataAddress", "'Sheet1'!A1:C4").option("header", "true").option("treatEmptyValuesAsNulls", "true").option("inferSchema", "true").load(excel_sample)
- However, when I executed the following line of code in the next Cmd cell,
display(excelSDF)
- I get a different exception:
NoSuchMethodError: org.apache.spark.sql.catalyst.util.FailureSafeParser.<init>(Lscala/Function1;Lorg/apache/spark/sql/catalyst/util/ParseMode;Lorg/apache/spark/sql/types/StructType;Ljava/lang/String;)V
- Labels:
-
Azure DBX
-
DataSourceV2
-
excel xlsx xls
-
Maven
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-15-2022 08:24 AM
Using the older library as suggested worked in DBR 10.4 LTS. Thank you.
On a separate note, my curiosity in understanding the changes in the underlying datasource v2 API is ongoing. 😀
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-15-2022 09:41 PM
This is the library dependency. You need to exclude the dependency to get it working. @Jim Huang
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-14-2022 03:48 PM
Thank you for providing another option to address this issue.
I have follow up questions:
- What should be the dependency to be excluded in this situation?
- How to exclude such dependency in Databricks runtime environment?
- Is there a reference you can provide regarding this approach?
Thanks!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-09-2023 04:32 AM
Hi @dataslicer were you able to solve this issue?
I am using 9.1 lts databricks version with Spark 3.1.2 and scala 2.12. I have installed com.crealytics:spark-excel-2.12.17-3.1.2_2.12:3.1.2_0.18.1. It was working fine but now facing same exception as you. Could you please help..
Thank you.
![](/skins/images/97567C72181EBE789E1F0FD869E4C89B/responsive_peak/images/icon_anonymous_message.png)
![](/skins/images/97567C72181EBE789E1F0FD869E4C89B/responsive_peak/images/icon_anonymous_message.png)