topic read percentage values in spark ( no casting ) in Data Engineering

read percentage values in spark ( no casting )

sarvesh — Wed, 01 Dec 2021 13:11:00 GMT

I have a xlsx file which has a single column ;

percentage

30%

40%

50%

-10%

0.00%

0.10%

110%

99.99%

99.98%

-99.99%

-99.98%

when i read this using Apache-Spark out put i get is,

|percentage|

+----------+

| 0.3|

| 0.4|

| 0.5|

| -0.1|

| 0.0|

| 0.001|

| 1.1|

| 0.9999|

| 0.9998|

+----------+

expected output is ,

+----------+

|percentage|

+----------+

| 30%|

| 40%|

| 50%|

| -10%|

| 0.00%|

| 0%|

| 0.10%|

| 110%|

| 99.99%|

| 99.98%|

+----------+

My code -

val spark = SparkSession

.builder

.appName("trimTest")

.master("local[*]")

.getOrCreate()

val df = spark.read

.format("com.crealytics.spark.excel").

option("header", "true").

option("maxRowsInMemory", 1000).

option("inferSchema", "true").

load("data/percentage.xlsx")

df.printSchema()

df.show(10)

I Don't want to use casting or turning inferschema to false, i want a way to read percentage value as percentage not as double or string.

Re: read percentage values in spark ( no casting )

Hubert-Dudek — Wed, 01 Dec 2021 13:32:10 GMT

Output is rather correct as this is as percentage are in excel (what is seen in excel is just formatting of cells). In Spark the same 100% = 1.

If you want to display as percentage for example in dashboard you just need to concatenate % sign.

.withColumn("rate",(col("rate") * 100).cast("int"))
.withColumn("rate",concat((col("rate") * 100).cast("int"),lit('%')))

Re: read percentage values in spark ( no casting )

-werners- — Wed, 01 Dec 2021 13:42:43 GMT

Affirmative. This is how excel stores percentages. What you see is just cell formatting.

Databricks notebooks do not (yet?) have the possibility to format the output.

But it is easy to use a BI tool on top of Databricks, where you can change the formatting.

And that is in my opinion how it should be done.

Re: read percentage values in spark ( no casting )

sarvesh — Wed, 01 Dec 2021 13:51:12 GMT

casting is not what i want suppose i get a big excel file with millions of rows, casting will make it super slow.

Re: read percentage values in spark ( no casting )

-werners- — Wed, 01 Dec 2021 13:54:10 GMT

No necessarely. Millions of rows is not that much. For Excel it is, but not for Spark.