cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Reading Excel files with Spark returns formula values instead of computed values

bts136
New Contributor

Hi,

I'm seeing inconsistent behavior when reading Excel files using the built-in connector Lakeflow Connector with spark.read.format("excel") (doc: https://docs.databricks.com/aws/en/query/formats/excel). I read an .xlsx file from S3 using this function a few days ago, and it returned computed/numeric values.

Today, running the exact same code against the exact same file path in S3 returns Excel formulas as strings instead (e.g., "=($B59+(K34/4.3))/K16"). The file has not been modified or re-uploaded in between. 

Has anyone experienced similar behavior, or knows what could cause this? Could it be because the reader is still in beta?

Appreciate any help!

2 REPLIES 2

bianca_unifeye
Databricks MVP

Databricks’ built-in Excel reader is Beta and is documented to ingest evaluated formulas (i.e., computed results). https://docs.databricks.com/aws/en/query/formats/excel

In practice, Excel files store both the formula and (optionally) a cached calculated result. Many readers (via Apache POI under the hood) do not recalculate formulas, they read the cached result if it exists, otherwise they may fall back to returning the formula string.

Why it can look “inconsistent” even if S3 file didn’t change

  • The most common reason is actually a change in runtime/connector version or execution environment (e.g., different DBR, different cluster policy, preview/feature rollout), which can change the fallback behaviour in a Beta feature. 

How to fix / mitigate

  1. Verify DBR version & cluster is identical between runs (Excel support requires DBR 17.1+)

  2. Ensure the workbook has fresh cached results:

    • Open the file in Excel (or equivalent), force recalculation, then Save (this writes cached results).

    • If you can’t rely on cached results, export to CSV (values-only) upstream.

  3. If you must evaluate formulas server-side, you’ll need a different approach (e.g., pre-processing outside Spark), because Spark/Databricks isn’t an Excel calculation engine.

SteveOstrowski
Databricks Employee
Databricks Employee

Hi @bts136,

This behavior is related to how Excel files store formula results internally, and it is something you can work around.

BACKGROUND: HOW EXCEL STORES FORMULAS

Excel files (.xlsx) store both the formula text and a cached computed result for each formula cell. When you open a file in Excel and save it, Excel evaluates all formulas and writes those cached results into the file. The Databricks Excel reader (spark.read.format("excel")) reads those cached computed values, which is why the documentation states that "formulas are ingested as their computed values."

However, if the Excel file was generated or modified by a tool other than Excel (for example, a Python library like openpyxl, a reporting tool, or an automated export process), the cached formula results may not be present in the file. In that case, the reader has no computed value to return and falls back to reading the raw formula string instead.

WHY THE BEHAVIOR CHANGED BETWEEN RUNS

A few scenarios could explain why you saw computed values on one run and formula strings on another:

1. The file was replaced or overwritten at the same S3 path by an upstream process that generates files without cached formula results. Even though the path is the same, the file content may differ.

2. The original file was saved from Excel (with cached results), then later re-exported or regenerated by an automated tool that does not cache formula results.

3. If you are using Auto Loader or a pipeline that picks up new versions of the file, a newly landed version of the file could have different caching behavior.

HOW TO FIX THIS

Option 1: Re-save the file from Excel

Open the .xlsx file in Microsoft Excel (or Google Sheets, then export as .xlsx). This forces all formulas to be evaluated and their results cached. Save and re-upload to S3. This is the simplest fix if you have a small number of files.

Option 2: Pre-process with openpyxl in Python

If the files are generated programmatically, you can add a step that evaluates and caches formula results before reading with Spark. Here is an example using openpyxl:

import openpyxl

wb = openpyxl.load_workbook("/dbfs/path/to/file.xlsx", data_only=True)
ws = wb.active

# data_only=True reads cached values, but if they are missing
# you will get None. In that case, use a formula evaluator.

Note that openpyxl's data_only=True mode reads cached values but does not evaluate formulas itself. If cached values are missing, cells will return None.

Option 3: Use a formula evaluation library

For a more robust approach, you can use a library that actually evaluates Excel formulas. The formulas library (pip install formulas) can do this:

import formulas

xl_model = formulas.ExcelModel().loads("/dbfs/path/to/file.xlsx").finish()
solution = xl_model.calculate()
xl_model.write(dirpath="/dbfs/path/to/output/")

This evaluates all formulas and writes a new file with cached results that the Databricks Excel reader can then read correctly.

Option 4: Read with pandas as an intermediate step

pandas uses openpyxl under the hood and can sometimes handle formula evaluation differently. You can read with pandas first, then convert to a Spark DataFrame:

import pandas as pd

pdf = pd.read_excel("/dbfs/path/to/file.xlsx", engine="openpyxl")
df = spark.createDataFrame(pdf)

Note that this approach works well for smaller files but may not scale for very large Excel files since it loads everything into driver memory.

Option 5: Fix the upstream file generation

If you control the process that creates the Excel file, the best long-term fix is to ensure that process writes cached formula results. For example, if using openpyxl to generate the file, you can set the calcMode property or use a library that supports formula evaluation during write.

ABOUT THE BETA STATUS

The Excel reader is currently in beta (available with Databricks Runtime 17.1 and above). The formula caching behavior described above is inherent to the Excel file format itself rather than a beta limitation of the reader. The reader correctly picks up cached formula results when they are present in the file.

DOCUMENTATION REFERENCE

For more details on the Excel reader and its options, see:
https://docs.databricks.com/aws/en/query/formats/excel.html

I hope this helps clarify the behavior. Let me know if you have questions about your specific file generation workflow and I can help narrow down the best approach.

* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.