cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

read percentage values in spark ( no casting )

sarvesh
Contributor III

I have a xlsx file which has a single column ;

percentage

30%

40%

50%

-10%

0.00%

0%

0.10%

110%

99.99%

99.98%

-99.99%

-99.98%

when i read this using Apache-Spark out put i get is,

|percentage|

+----------+

| 0.3|

| 0.4|

| 0.5|

| -0.1|

| 0.0|

| 0.0|

| 0.001|

| 1.1|

| 0.9999|

| 0.9998|

+----------+

expected output is ,

+----------+

|percentage|

+----------+

| 30%|

| 40%|

| 50%|

| -10%|

| 0.00%|

| 0%|

| 0.10%|

| 110%|

| 99.99%|

| 99.98%|

+----------+

My code -

val spark = SparkSession

.builder

.appName("trimTest")

.master("local[*]")

.getOrCreate()

val df = spark.read

.format("com.crealytics.spark.excel").

option("header", "true").

option("maxRowsInMemory", 1000).

option("inferSchema", "true").

load("data/percentage.xlsx")

df.printSchema()

df.show(10)

I Don't want to use casting or turning inferschema to false, i want a way to read percentage value as percentage not as double or string.

4 REPLIES 4

Hubert-Dudek
Esteemed Contributor III

Output is rather correct as this is as percentage are in excel (what is seen in excel is just formatting of cells). In Spark the same 100% = 1.

If you want to display as percentage for example in dashboard you just need to concatenate % sign.

.withColumn("rate",(col("rate") * 100).cast("int"))
.withColumn("rate",concat((col("rate") * 100).cast("int"),lit('%')))

casting is not what i want suppose i get a big excel file with millions of rows, casting will make it super slow.

-werners-
Esteemed Contributor III

No necessarely. Millions of rows is not that much. For Excel it is, but not for Spark.

-werners-
Esteemed Contributor III

Affirmative. This is how excel stores percentages. What you see is just cell formatting.

Databricks notebooks do not (yet?) have the possibility to format the output.

But it is easy to use a BI tool on top of Databricks, where you can change the formatting.

And that is in my opinion how it should be done.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.