topic spark.read excel with formula in Data Engineering

spark.read excel with formula

Braxx — Mon, 10 Jan 2022 15:07:19 GMT

For some reason spark is not reading the data correctly from xlsx file in the column with a formula. I am reading it from a blob storage.

Consider this simple data set

The column "color" has formulas for all the cells like

=VLOOKUP(A4,C3:D5,2,0)

In cases where the formula could not be calculated it is read differently by excel and spark:

excel - #N/A

spark - =VLOOKUP(A4,C3:D5,2,0)

Here is my code:

 df= spark.read\
   .format("com.crealytics.spark.excel")\
   .option("header", "true")\
   .load(input_path + input_folder_general + "test1.xlsx")
    
 display(df)

And here is how the above dataset is read:

How do I get #N/A instead of a formula?

Re: spark.read excel with formula

-werners- — Mon, 10 Jan 2022 15:45:44 GMT

the formula itself isprobably what is actually stored in the excel file.

Excel translates this to NA.

I only know of setErrorCellsToFallbackValues but I doubt if this is applicable in your case here.

You could use a matching function (regexp f.e.) to determine if a row contains actual output or a formula.

Re: spark.read excel with formula

Braxx — Tue, 11 Jan 2022 10:21:00 GMT

accually, the formula is underneeth all the "color" values. Red and blue are the results of a formula and are displayed correctly.The issue is in cases when the formula could not calculate the value.

Is there any way to read only the results of formulas. #N/A as #N/A. Not a formula itself?

Using regexp is risky as I have no guarantee the formula's syntax will have the same pattern.

Re: spark.read excel with formula

-werners- — Tue, 11 Jan 2022 10:24:56 GMT

Spark will just consume what you throw at it, it cannot interpret excel formulas etc.

So the way to go is to make sure your formula always resolves.