cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Using spark.read.excel – dataAddress with only start cell is not working (DBR 17.x)

DhivyaKeerthana
New Contributor III
Hi, has anyone successfully used the Databricks Runtime 17.x native Excel reader with a dataAddress containing only a start cell (no end cell)? Even in the documentation, it is not specified (https://learn.microsoft.com/en-us/azure/databricks/query/formats/excel).
 
Below code is not working
df = spark.read.format("excel").option("headerRows", 1).option("dataAddress", "Sheet1!A7").load(filepath)
 
In most scenarios, we don’t know the end cell address because both the number of rows and columns change for every file. How is the native Excel reader expected to help in these cases if dataAddress with only a start address (e.g. Sheet1!A7) does not automatically expand to the bottom‑right non‑empty cell?
 
 
4 REPLIES 4

saurabh18cs
Honored Contributor III

Hi @DhivyaKeerthana 

No — Databricks Runtime 17.x does not support dataAddress with only a start cell (e.g., "Sheet1!A7"). how about full sheet reference like 
Full sheet reference ("Sheet1")
 
Br

@saurabh18cs Thanks for the response. My data header row is located at A7. Reading full sheet does not help unfortunately.

SteveOstrowski
Databricks Employee
Databricks Employee

Hi @DhivyaKeerthana,

Welcome to the community! You are correct that the native Excel reader (DBR 17.x) does not currently support a start-cell-only dataAddress like "Sheet1!A7". The documented dataAddress formats are:

- A full range: "Sheet1!C5:H10"
- A sheet name only: "Sheet1"
- Omitted entirely (reads all data from the first sheet)

Since your header row starts at A7 and the number of rows and columns varies per file, here are a couple of approaches that should work well for you.


OPTION 1: USE A LARGE BOUNDING RANGE

You can specify a range with a deliberately large end cell. The reader will only read up to the bottom-right non-empty cell within that range, so empty cells beyond your data will not cause problems:

df = (spark.read.format("excel")
.option("headerRows", 1)
.option("dataAddress", "Sheet1!A7:ZZ1000000")
.load(filepath))

This tells the reader to start at A7 and look for data up to column ZZ, row 1,000,000. In practice it will stop at the last non-empty cell, so the variable row/column count is handled automatically. This is the simplest approach and should work directly with your use case.


OPTION 2: READ THE FULL SHEET AND SKIP ROWS PROGRAMMATICALLY

If you prefer more control, you can read the entire sheet without headerRows, then trim the leading rows in PySpark:

# Read the entire sheet with no headers (all values as strings)
raw_df = (spark.read.format("excel")
.option("headerRows", 0)
.option("dataAddress", "Sheet1")
.load(filepath))

# Skip the first 6 rows (rows before your header at row 7)
# Row indices are 0-based after reading
from pyspark.sql import functions as F
from pyspark.sql.window import Window

w = Window.orderBy(F.monotonically_increasing_id())
indexed_df = raw_df.withColumn("_row_idx", F.row_number().over(w))

# Row 7 in Excel = row index 7 after adding row numbers (first 6 rows are before your header)
data_df = indexed_df.filter(F.col("_row_idx") > 6).drop("_row_idx")

# Use the first remaining row as header
header = data_df.first()
new_columns = [str(header[i]) for i in range(len(header))]
data_df = data_df.filter(F.col("_c0") != str(header[0]))
for i, col_name in enumerate(new_columns):
data_df = data_df.withColumnRenamed(f"_c{i}", col_name)

This is more code but gives you full control over row skipping logic.


RECOMMENDED APPROACH

Option 1 (the large bounding range) is the cleanest solution and aligns with how the native reader works. The documentation confirms that the parser reads from the specified start cell to the "bottom-right non-empty cell," so providing a generous end boundary is a safe and effective pattern.

For reference, here is the documentation page for the native Excel reader:
https://learn.microsoft.com/en-us/azure/databricks/query/formats/excel

Hope this helps you move forward. Let us know how it goes!

* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.

DhivyaKeerthana
New Contributor III

Thanks @SteveOstrowski for the response. Yes I am using OPTION 1: USE A LARGE BOUNDING RANGE as a workaround.