topic Re: How to read excel file using databricks in Data Engineering

How to read excel file using databricks

PraveenSaini — Tue, 07 May 2019 12:14:16 GMT

I have a excel file as source file and i want to read data from excel file and convert data in data frame using databricks. I have already added maven dependence for Excel file format. when i a tring below code it is giving error .(Error: java.io.FileNotFoundException: /FileStore/tables/Airline.xlsx (No such file or directory) But file is available. Please help me on this code.

val df = spark.read.format("com.crealytics.spark.excel")

.option("location", "/FileStore/tables/Airline.xlsx")

.option("useHeader", "true")

.option("treatEmptyValuesAsNulls", "false")

.option("inferSchema", "false")

.option("addColorColumns", "false")

.load("/FileStore/tables/Airline.xlsx")

Re: How to read excel file using databricks

ashish1 — Tue, 07 May 2019 14:46:39 GMT

Hi,

You can try -

val df = spark.read
          .format("org.zuinnote.spark.office.excel")
          .option("read.spark.useHeader", "true")  
          .load("dbfs:/FileStore/tables/Airline.xlsx")

Re: How to read excel file using databricks

MounicaVemulapa — Tue, 11 Jun 2019 08:36:57 GMT

@praveen.. Hi Praveen.. Did you get any workaround for this.. I'm facing the same issue.

Re: How to read excel file using databricks

MounicaVemulapa — Tue, 11 Jun 2019 08:39:05 GMT

@ashish@databricks.com.. Hi Ashish... I'm getting error java.lang.NoSuchMethodError: org.apache.spark.sql.execution.datasources.FileFormat.$init$(Lorg/apache/spark/sql/execution/datasources/FileFormat;) when I used your logic..

I have installed spark_hadoopoffice_ds_2_12_1_3_1.jar for the above class.. Please help

Re: How to read excel file using databricks

Saphira — Thu, 13 Jun 2019 09:52:39 GMT

There should be nothing wrong with your code, the same code (except for the file name) works for me. Can you confirm that using: dbutils.fs.ls("dbfs:/FileStore/tables") prints at least your FileInfo, and that your cluster shows status 'installed' for the library with maven coordinates "com.crealytics:spark-excel_2.11:0.11.1" ?

Re: How to read excel file using databricks

darkfenixx1 — Thu, 27 Jun 2019 23:11:30 GMT

 I have the same problem, did you solve it?

Re: How to read excel file using databricks

vikrantm — Tue, 24 Sep 2019 09:42:20 GMT

also tried with suggested library, but installation of "com.crealytics:spark-excel_2.11:0.11.1" is failing continuously. (tried for latest versions also).

Re: How to read excel file using databricks

Saphira — Tue, 24 Sep 2019 09:50:04 GMT

Does it give the error while installing : ?

AttributeError: module 'lib' has no attribute 'SSL_ST_INIT'

Re: How to read excel file using databricks

vikrantm — Tue, 24 Sep 2019 10:14:16 GMT

Yes it gives below error while installing on cluster :

Library resolution failed. Cause: java.lang.RuntimeException: org.tukaani:xz download failed. at com.databricks.libraries.server.MavenInstaller.$anonfun$resolveDependencyPaths$5(MavenLibraryResolver.scala:253) at scala.collection.MapLike.getOrElse(MapLike.scala:131) at scala.collection.MapLike.getOrElse$(MapLike.scala:129) at

Re: How to read excel file using databricks

ttration — Tue, 24 Sep 2019 13:00:09 GMT

For me the problem was the library was for scala 2.12 and my cluster was running scale 2.11 (should've been spark_hadoopoffice_ds_2_11_1_3_1)

Re: How to read excel file using databricks

LeiSun1992 — Tue, 19 Nov 2019 02:52:14 GMT

(1) login in your databricks account, click clusters, then double click the cluster you want to work with.

(2) click Libraries , click Install New

(3) click Maven,In Coordinates , paste this line

 com.crealytics:spark-excel_2.11:0.12.2

to intall libs.

(4) After the lib installation is over, open a notebook to read excel file as follow code shows, it can work!

val sparkDF = spark.read.format("com.crealytics.spark.excel")
.option("useHeader", "true")
.option("inferSchema", "true")
.load("/mnt/lsTest/test.xlsx")<br>display(sparkDF.collect())

<br>

Re: How to read excel file using databricks

LeiSun1992 — Tue, 19 Nov 2019 02:55:36 GMT

The lib u use is out of date.

you have to install the latest lib.

(1) login in your databricks account, click clusters, then double click the cluster you want to work with.

(2) click Libraries , click Install New

(3) click Maven,In Coordinates , paste this line

com.crealytics:spark-excel_2.11:0.12.2

to intall libs.

Re: How to read excel file using databricks

SakthivelNachim — Sun, 23 Feb 2020 13:46:04 GMT

This works as expected with com.crealytics:spark-excel_2.11:0.12.5 libray.

val df_excel= spark.read. format("com.crealytics.spark.excel"). option("useHeader", "true"). option("treatEmptyValuesAsNulls", "false"). option("inferSchema", "false"). option("addColorColumns", "false").load(file_path) display(df_excel)

Re: How to read excel file using databricks

PrekshaPunwani — Wed, 22 Jul 2020 07:32:47 GMT

dropping the ".xlsx" from the file path worked for me!

Re: How to read excel file using databricks

edwards142 — Fri, 11 Dec 2020 08:37:45 GMT

Don’t worry you have several other options to open Excel file without Excel. Here are those options, so please check it out..!

http://www.repairmsexcel.com/blog/open-excel-files-without-excel

Re: How to read excel file using databricks

Devarsh — Tue, 17 May 2022 14:00:38 GMT

First of all check your spark and scala version.

Then install the library with Maven coordinates according to your spark and scala version.

Check further on this link to know more about the Maven coordinates to use:

https://mvnrepository.com/artifact/com.crealytics/spark-excel_2.12

Selected Cluster --> Libraries --> Install New --> Maven -->

Coordinates- com.crealytics:spark-excel_2.12:3.2.1_0.16.4

For pyspark use the following code:

df2 = spark.read.format("com.crealytics.spark.excel").option("header", "true").option("inferSchema", "true").load("dbfs:/FileStore/shared_uploads/abc@gmail.com/book.xlsx")
display(df2)

Re: How to read excel file using databricks

Anonymous — Sat, 19 Nov 2022 10:16:24 GMT

Another way also help for your case is usign Pandas to read excel then convert Pandas Dataframe to Pyspark Dataframe 🙂

Re: How to read excel file using databricks

Ananth — Wed, 30 Nov 2022 19:25:52 GMT

This really worked. However I see this error for larger excel files.

shadeio.poi.util.RecordFormatException: Tried to allocate an array of length 208,933,193, but the maximum length for this record type is 100,000,000.

Re: How to read excel file using databricks

Datab — Fri, 15 Sep 2023 05:44:06 GMT

No thanks

Re: How to read excel file using databricks

Gaurav_Databric — Fri, 15 Sep 2023 05:51:16 GMT

# Example: Show the first 5 rows of the DataFrame
df.head()

# For Scala
// Example: Show the first 5 rows of the DataFrame
df.show(5)

Step 7: Perform Data Visualization (Optional) If you wish to visualize the data, Databricks provides various plotting libraries and visualization tools to present your findings effectively.

Step 8: Save or Export Results (Optional) After performing your analysis, if you want to save the processed data or export the results, Databricks supports various formats such as Parquet, CSV, JSON, etc.