cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Databricks Standard SharePoint Connector Performance Issues

ConnorK
Databricks Partner

I've recently started using the Databricks Standard SharePoint connector within my workspace and have run into some significant performance issues.

My notebook does a straightforward read using the following:

lakeflow_connection_name = 'sharepoint_dev'
sharepoint_site_url = 'https://example.sharepoint.com/sites/example_site/docs/ABC*/files/ABC*.xlsm'
sheet_name = 'export'
sheet_range = 'A1:Z2'

excel_df = (spark.read
    .format("excel")
    .option("databricks.connection", lakeflow_connection_name)
    .option("headerRows", 1)
    .option("inferSchema", False)
    .option("dataAddress", f"{sheet_name}!{sheet_range}")
    .load(sharepoint_site_url)
)

The path uses two wildcard levels: first to match a set of docs directories (ABC*), and then to match specific .xlsm files within a fixed subdirectory (files/ABC*.xlsm).

Our SharePoint site has around 5,000 directories, the vast majority of which contain no matching files. In our current dev environment there are only 10 files that satisfy the wildcard criteria, yet the load is consistently taking 40+ minutes to return them.

My assumption is that the connector is scanning all the files in my SharePoint, effectively making thousands of calls before a single file is read. Is that correct?

Has anyone found a way to speed this up? Are there any connector options, configuration settings, or recommended patterns for this kind of wildcard-heavy path? Any guidance would be appreciated.

3 REPLIES 3

bala_sai
New Contributor

Yes I think the delay is likely coming from file discovery rather than reading the Excel files.

Even if only 10 files match in dev, Databricks still has to find them first. With "docs/ABC*/files/ABC*.xlsm", it can end up scanning a big chunk of the SharePoint folder before it gets to those 10 files. You can test it by pointing ".load()" to one known folder with one known file. If that comes back fast, then the issue is definitely the wildcard discovery.

You can try to avoid the multilevel wildcard if possible. Either point to a smaller fixed folder and use pathGlobFilter, or keep a small manifest of exact file URL's. If this runs regularly it is better to stage the files to cloud storage/UC Volume first and read from there instead of making SharePoint do the discovery every time.

balajij8
Contributor III

The standard SharePoint connector doesn't support folder path-based filtering well. When you use wildcards in the path itself (ABC*/files/ABC*.xlsm), the connector has to enumerate directories at the SharePoint level to resolve the patterns leading to making many API calls across 5,000 directories.

  • You can use pathGlobFilter instead of path wildcards
lakeflow_connection_name = 'sharepoint_dev'
sharepoint_site_url = 'https://example.sharepoint.com/sites/example_site/docs'

excel_df = (spark.read
    .format("excel")
    .option("databricks.connection", lakeflow_connection_name)
    .option("headerRows", 1)
    .option("inferSchema", False)
    .option("dataAddress", f"{sheet_name}!{sheet_range}")
    .option("pathGlobFilter", "ABC*/files/ABC*.xlsm")  # Filter here
    .load(sharepoint_site_url)
)

pathGlobFilter filters files by name after the connector retrieves the file list and is generally more efficient than path-level wildcards

  • Be more specific with paths - If you know the specific ABC directory names, query them explicitly in separate reads and union the results
target_dirs = ['ABC001', 'ABC002', 'ABC003']  # directories
dfs = []

for dir_name in target_dirs:
    path = f'https://example.sharepoint.com/sites/example_site/docs/{dir_name}/files/{dir_name}*.xlsm'
    df = spark.read.format("excel")...load(path)
    Add Append df code & use​

 

Yogasathyandrun
New Contributor

I think your diagnosis is likely correct.

One thing that stands out is that you’re only reading A1:Z2 from each workbook. Given that the operation is still taking 40+ minutes, the bottleneck is unlikely to be the Excel parsing itself and more likely to be file discovery.

With ~5,000 directories and a multilevel wildcard (ABC*/files/ABC*.xlsm), the connector may be spending most of its time resolving the matching paths before it ever starts reading data.

I’d also be cautious about relying on pathGlobFilter here. Even if it helps narrow file selection, the expensive part appears to be discovering the files in the first place.

As a quick validation, I’d try reading a few known paths explicitly and compare the runtime. If that drops significantly, then wildcard resolution is likely the dominant cost, and a manifest-driven or staged ingestion pattern may be a better long-term approach.

Data Engineer | Apache Spark | Delta Lake | Databricks