cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

read percentage values in spark ( no casting )

sarvesh
Contributor III

I have a xlsx file which has a single column ;

percentage

30%

40%

50%

-10%

0.00%

0%

0.10%

110%

99.99%

99.98%

-99.99%

-99.98%

when i read this using Apache-Spark out put i get is,

|percentage|

+----------+

| 0.3|

| 0.4|

| 0.5|

| -0.1|

| 0.0|

| 0.0|

| 0.001|

| 1.1|

| 0.9999|

| 0.9998|

+----------+

expected output is ,

+----------+

|percentage|

+----------+

| 30%|

| 40%|

| 50%|

| -10%|

| 0.00%|

| 0%|

| 0.10%|

| 110%|

| 99.99%|

| 99.98%|

+----------+

My code -

val spark = SparkSession

.builder

.appName("trimTest")

.master("local[*]")

.getOrCreate()

val df = spark.read

.format("com.crealytics.spark.excel").

option("header", "true").

option("maxRowsInMemory", 1000).

option("inferSchema", "true").

load("data/percentage.xlsx")

df.printSchema()

df.show(10)

I Don't want to use casting or turning inferschema to false, i want a way to read percentage value as percentage not as double or string.

4 REPLIES 4

Hubert-Dudek
Esteemed Contributor III

Output is rather correct as this is as percentage are in excel (what is seen in excel is just formatting of cells). In Spark the same 100% = 1.

If you want to display as percentage for example in dashboard you just need to concatenate % sign.

.withColumn("rate",(col("rate") * 100).cast("int"))
.withColumn("rate",concat((col("rate") * 100).cast("int"),lit('%')))

casting is not what i want suppose i get a big excel file with millions of rows, casting will make it super slow.

-werners-
Esteemed Contributor III

No necessarely. Millions of rows is not that much. For Excel it is, but not for Spark.

-werners-
Esteemed Contributor III

Affirmative. This is how excel stores percentages. What you see is just cell formatting.

Databricks notebooks do not (yet?) have the possibility to format the output.

But it is easy to use a BI tool on top of Databricks, where you can change the formatting.

And that is in my opinion how it should be done.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group