Hi @DhivyaKeerthana,
Welcome to the community! You are correct that the native Excel reader (DBR 17.x) does not currently support a start-cell-only dataAddress like "Sheet1!A7". The documented dataAddress formats are:
- A full range: "Sheet1!C5:H10"
- A sheet name only: "Sheet1"
- Omitted entirely (reads all data from the first sheet)
Since your header row starts at A7 and the number of rows and columns varies per file, here are a couple of approaches that should work well for you.
OPTION 1: USE A LARGE BOUNDING RANGE
You can specify a range with a deliberately large end cell. The reader will only read up to the bottom-right non-empty cell within that range, so empty cells beyond your data will not cause problems:
df = (spark.read.format("excel")
.option("headerRows", 1)
.option("dataAddress", "Sheet1!A7:ZZ1000000")
.load(filepath))
This tells the reader to start at A7 and look for data up to column ZZ, row 1,000,000. In practice it will stop at the last non-empty cell, so the variable row/column count is handled automatically. This is the simplest approach and should work directly with your use case.
OPTION 2: READ THE FULL SHEET AND SKIP ROWS PROGRAMMATICALLY
If you prefer more control, you can read the entire sheet without headerRows, then trim the leading rows in PySpark:
# Read the entire sheet with no headers (all values as strings)
raw_df = (spark.read.format("excel")
.option("headerRows", 0)
.option("dataAddress", "Sheet1")
.load(filepath))
# Skip the first 6 rows (rows before your header at row 7)
# Row indices are 0-based after reading
from pyspark.sql import functions as F
from pyspark.sql.window import Window
w = Window.orderBy(F.monotonically_increasing_id())
indexed_df = raw_df.withColumn("_row_idx", F.row_number().over(w))
# Row 7 in Excel = row index 7 after adding row numbers (first 6 rows are before your header)
data_df = indexed_df.filter(F.col("_row_idx") > 6).drop("_row_idx")
# Use the first remaining row as header
header = data_df.first()
new_columns = [str(header[i]) for i in range(len(header))]
data_df = data_df.filter(F.col("_c0") != str(header[0]))
for i, col_name in enumerate(new_columns):
data_df = data_df.withColumnRenamed(f"_c{i}", col_name)
This is more code but gives you full control over row skipping logic.
RECOMMENDED APPROACH
Option 1 (the large bounding range) is the cleanest solution and aligns with how the native reader works. The documentation confirms that the parser reads from the specified start cell to the "bottom-right non-empty cell," so providing a generous end boundary is a safe and effective pattern.
For reference, here is the documentation page for the native Excel reader:
https://learn.microsoft.com/en-us/azure/databricks/query/formats/excel
Hope this helps you move forward. Let us know how it goes!
* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.