Hi @bts136,
This behavior is related to how Excel files store formula results internally, and it is something you can work around.
BACKGROUND: HOW EXCEL STORES FORMULAS
Excel files (.xlsx) store both the formula text and a cached computed result for each formula cell. When you open a file in Excel and save it, Excel evaluates all formulas and writes those cached results into the file. The Databricks Excel reader (spark.read.format("excel")) reads those cached computed values, which is why the documentation states that "formulas are ingested as their computed values."
However, if the Excel file was generated or modified by a tool other than Excel (for example, a Python library like openpyxl, a reporting tool, or an automated export process), the cached formula results may not be present in the file. In that case, the reader has no computed value to return and falls back to reading the raw formula string instead.
WHY THE BEHAVIOR CHANGED BETWEEN RUNS
A few scenarios could explain why you saw computed values on one run and formula strings on another:
1. The file was replaced or overwritten at the same S3 path by an upstream process that generates files without cached formula results. Even though the path is the same, the file content may differ.
2. The original file was saved from Excel (with cached results), then later re-exported or regenerated by an automated tool that does not cache formula results.
3. If you are using Auto Loader or a pipeline that picks up new versions of the file, a newly landed version of the file could have different caching behavior.
HOW TO FIX THIS
Option 1: Re-save the file from Excel
Open the .xlsx file in Microsoft Excel (or Google Sheets, then export as .xlsx). This forces all formulas to be evaluated and their results cached. Save and re-upload to S3. This is the simplest fix if you have a small number of files.
Option 2: Pre-process with openpyxl in Python
If the files are generated programmatically, you can add a step that evaluates and caches formula results before reading with Spark. Here is an example using openpyxl:
import openpyxl
wb = openpyxl.load_workbook("/dbfs/path/to/file.xlsx", data_only=True)
ws = wb.active
# data_only=True reads cached values, but if they are missing
# you will get None. In that case, use a formula evaluator.
Note that openpyxl's data_only=True mode reads cached values but does not evaluate formulas itself. If cached values are missing, cells will return None.
Option 3: Use a formula evaluation library
For a more robust approach, you can use a library that actually evaluates Excel formulas. The formulas library (pip install formulas) can do this:
import formulas
xl_model = formulas.ExcelModel().loads("/dbfs/path/to/file.xlsx").finish()
solution = xl_model.calculate()
xl_model.write(dirpath="/dbfs/path/to/output/")
This evaluates all formulas and writes a new file with cached results that the Databricks Excel reader can then read correctly.
Option 4: Read with pandas as an intermediate step
pandas uses openpyxl under the hood and can sometimes handle formula evaluation differently. You can read with pandas first, then convert to a Spark DataFrame:
import pandas as pd
pdf = pd.read_excel("/dbfs/path/to/file.xlsx", engine="openpyxl")
df = spark.createDataFrame(pdf)
Note that this approach works well for smaller files but may not scale for very large Excel files since it loads everything into driver memory.
Option 5: Fix the upstream file generation
If you control the process that creates the Excel file, the best long-term fix is to ensure that process writes cached formula results. For example, if using openpyxl to generate the file, you can set the calcMode property or use a library that supports formula evaluation during write.
ABOUT THE BETA STATUS
The Excel reader is currently in beta (available with Databricks Runtime 17.1 and above). The formula caching behavior described above is inherent to the Excel file format itself rather than a beta limitation of the reader. The reader correctly picks up cached formula results when they are present in the file.
DOCUMENTATION REFERENCE
For more details on the Excel reader and its options, see:
https://docs.databricks.com/aws/en/query/formats/excel.html
I hope this helps clarify the behavior. Let me know if you have questions about your specific file generation workflow and I can help narrow down the best approach.
* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.