I've recently started using the Databricks Standard SharePoint connector within my workspace and have run into some significant performance issues.
My notebook does a straightforward read using the following:
lakeflow_connection_name = 'sharepoint_dev'
sharepoint_site_url = 'https://example.sharepoint.com/sites/example_site/docs/ABC*/files/ABC*.xlsm'
sheet_name = 'export'
sheet_range = 'A1:Z2'
excel_df = (spark.read
.format("excel")
.option("databricks.connection", lakeflow_connection_name)
.option("headerRows", 1)
.option("inferSchema", False)
.option("dataAddress", f"{sheet_name}!{sheet_range}")
.load(sharepoint_site_url)
)
The path uses two wildcard levels: first to match a set of docs directories (ABC*), and then to match specific .xlsm files within a fixed subdirectory (files/ABC*.xlsm).
Our SharePoint site has around 5,000 directories, the vast majority of which contain no matching files. In our current dev environment there are only 10 files that satisfy the wildcard criteria, yet the load is consistently taking 40+ minutes to return them.
My assumption is that the connector is scanning all the files in my SharePoint, effectively making thousands of calls before a single file is read. Is that correct?
Has anyone found a way to speed this up? Are there any connector options, configuration settings, or recommended patterns for this kind of wildcard-heavy path? Any guidance would be appreciated.