cancel
Showing results for 
Search instead for 
Did you mean: 
Machine Learning
Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.
cancel
Showing results for 
Search instead for 
Did you mean: 

Error when reading Excel file: "org.apache.poi.ooxml.POIXMLException: Strict OOXML isn't currently supported, please see bug #57699"

Mado
Valued Contributor II

Hi,

I want to read an Excel "xlsx" file. The excel file has several sheets and multi-row header. The original file format was "xlsm" and I changed the extension to "xlsx".

I try the following code:

filepath_xlsx = "dbfs:/FileStore/Sample_Excel/data.xlsx"
 
sampleDF_xlsx = (spark.read.format("com.crealytics.spark.excel")
  .option("sheetName", Sheet1") 
  .option("Header", "true") 
  .option("inferSchema", "false") 
  .option("treatEmptyValuesAsNulls", "false") 
  .load(filepath_xlsx)
            )
 
display(sampleDF_xlsx)

And I get the error:

org.apache.poi.ooxml.POIXMLException: Strict OOXML isn't currently supported, please see bug #57699
 
---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
<command-496434324351845> in <cell line: 4>()
      2 
      3 # Read excel file
----> 4 sampleDF_xlsx = (spark.read.format("com.crealytics.spark.excel")
      5   .option("sheetName", "Database BT")
      6   .option("Header", "true")
 
/databricks/spark/python/pyspark/instrumentation_utils.py in wrapper(*args, **kwargs)
     46             start = time.perf_counter()
     47             try:
---> 48                 res = func(*args, **kwargs)
     49                 logger.log_success(
     50                     module_name, class_name, function_name, time.perf_counter() - start, signature
 
/databricks/spark/python/pyspark/sql/readwriter.py in load(self, path, format, schema, **options)
    175         self.options(**options)
    176         if isinstance(path, str):
--> 177             return self._df(self._jreader.load(path))
    178         elif path is not None:
    179             if type(path) != list:
 
/databricks/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py in __call__(self, *args)
   1319 
   1320         answer = self.gateway_client.send_command(command)
-> 1321         return_value = get_return_value(
   1322             answer, self.gateway_client, self.target_id, self.name)
   1323 
 
/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
    194     def deco(*a: Any, **kw: Any) -> Any:
    195         try:
--> 196             return f(*a, **kw)
    197         except Py4JJavaError as e:
    198             converted = convert_exception(e.java_exception)
 
/databricks/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    324             value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
    325             if answer[1] == REFERENCE_TYPE:
--> 326                 raise Py4JJavaError(
    327                     "An error occurred while calling {0}{1}{2}.\n".
    328                     format(target_id, ".", name), value)
 
Py4JJavaError: An error occurred while calling o578.load.
: org.apache.poi.ooxml.POIXMLException: Strict OOXML isn't currently supported, please see bug #57699
	at org.apache.poi.ooxml.POIXMLDocumentPart.getPartFromOPCPackage(POIXMLDocumentPart.java:757)
	at org.apache.poi.ooxml.POIXMLDocumentPart.<init>(POIXMLDocumentPart.java:151)
	at org.apache.poi.ooxml.POIXMLDocumentPart.<init>(POIXMLDocumentPart.java:141)
	at org.apache.poi.ooxml.POIXMLDocument.<init>(POIXMLDocument.java:60)
	at org.apache.poi.xssf.usermodel.XSSFWorkbook.<init>(XSSFWorkbook.java:254)
	at org.apache.poi.xssf.usermodel.XSSFWorkbookFactory.createWorkbook(XSSFWorkbookFactory.java:118)
	at org.apache.poi.xssf.usermodel.XSSFWorkbookFactory.create(XSSFWorkbookFactory.java:98)
	at org.apache.poi.xssf.usermodel.XSSFWorkbookFactory.create(XSSFWorkbookFactory.java:36)
	at org.apache.poi.ss.usermodel.WorkbookFactory.lambda$create$2(WorkbookFactory.java:224)
	at org.apache.poi.ss.usermodel.WorkbookFactory.wp(WorkbookFactory.java:329)
	at org.apache.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:224)
	at org.apache.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:185)
	at com.crealytics.spark.excel.DefaultWorkbookReader.$anonfun$openWorkbook$3(WorkbookReader.scala:107)
	at scala.Option.fold(Option.scala:251)
	at com.crealytics.spark.excel.DefaultWorkbookReader.openWorkbook(WorkbookReader.scala:107)
	at com.crealytics.spark.excel.WorkbookReader.withWorkbook(WorkbookReader.scala:34)
	at com.crealytics.spark.excel.WorkbookReader.withWorkbook$(WorkbookReader.scala:33)
	at com.crealytics.spark.excel.DefaultWorkbookReader.withWorkbook(WorkbookReader.scala:92)
	at com.crealytics.spark.excel.ExcelRelation.excerpt$lzycompute(ExcelRelation.scala:48)
	at com.crealytics.spark.excel.ExcelRelation.excerpt(ExcelRelation.scala:48)
	at com.crealytics.spark.excel.ExcelRelation.headerColumns$lzycompute(ExcelRelation.scala:121)
	at com.crealytics.spark.excel.ExcelRelation.headerColumns(ExcelRelation.scala:120)
	at com.crealytics.spark.excel.ExcelRelation.$anonfun$inferSchema$1(ExcelRelation.scala:189)
	at scala.Option.getOrElse(Option.scala:189)
	at com.crealytics.spark.excel.ExcelRelation.inferSchema(ExcelRelation.scala:188)
	at com.crealytics.spark.excel.ExcelRelation.<init>(ExcelRelation.scala:52)
	at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:52)
	at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:29)
	at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:24)
	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:385)
	at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:368)
	at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:324)
	at scala.Option.getOrElse(Option.scala:189)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:324)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:237)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
	at py4j.Gateway.invoke(Gateway.java:306)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:195)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:115)
	at java.lang.Thread.run(Thread.java:750)

I have installed "com.crealytics:spark-excel_2.12:3.3.1_0.18.5" on the cluster.

I use the same code for another dataset (single-sheet, single-row header, and cleaned dataset). And, I don't get any error.

Any idea to solve this issue?

0 REPLIES 0

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group