cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Databricks Standard SharePoint Connector Performance Issues

ConnorK
Databricks Partner

I've recently started using the Databricks Standard SharePoint connector within my workspace and have run into some significant performance issues.

My notebook does a straightforward read using the following:

lakeflow_connection_name = 'sharepoint_dev'
sharepoint_site_url = 'https://example.sharepoint.com/sites/example_site/docs/ABC*/files/ABC*.xlsm'
sheet_name = 'export'
sheet_range = 'A1:Z2'

excel_df = (spark.read
    .format("excel")
    .option("databricks.connection", lakeflow_connection_name)
    .option("headerRows", 1)
    .option("inferSchema", False)
    .option("dataAddress", f"{sheet_name}!{sheet_range}")
    .load(sharepoint_site_url)
)

The path uses two wildcard levels: first to match a set of docs directories (ABC*), and then to match specific .xlsm files within a fixed subdirectory (files/ABC*.xlsm).

Our SharePoint site has around 5,000 directories, the vast majority of which contain no matching files. In our current dev environment there are only 10 files that satisfy the wildcard criteria, yet the load is consistently taking 40+ minutes to return them.

My assumption is that the connector is scanning all the files in my SharePoint, effectively making thousands of calls before a single file is read. Is that correct?

Has anyone found a way to speed this up? Are there any connector options, configuration settings, or recommended patterns for this kind of wildcard-heavy path? Any guidance would be appreciated.

1 REPLY 1

bala_sai
New Contributor

Yes I think the delay is likely coming from file discovery rather than reading the Excel files.

Even if only 10 files match in dev, Databricks still has to find them first. With "docs/ABC*/files/ABC*.xlsm", it can end up scanning a big chunk of the SharePoint folder before it gets to those 10 files. You can test it by pointing ".load()" to one known folder with one known file. If that comes back fast, then the issue is definitely the wildcard discovery.

You can try to avoid the multilevel wildcard if possible. Either point to a smaller fixed folder and use pathGlobFilter, or keep a small manifest of exact file URL's. If this runs regularly it is better to stage the files to cloud storage/UC Volume first and read from there instead of making SharePoint do the discovery every time.