Databricks Community

ConnorK · 3 weeks ago

I've recently started using the Databricks Standard SharePoint connector within my workspace and have run into some significant performance issues.

My notebook does a straightforward read using the following:

lakeflow_connection_name = 'sharepoint_dev'
sharepoint_site_url = 'https://example.sharepoint.com/sites/example_site/docs/ABC*/files/ABC*.xlsm'
sheet_name = 'export'
sheet_range = 'A1:Z2'

excel_df = (spark.read
    .format("excel")
    .option("databricks.connection", lakeflow_connection_name)
    .option("headerRows", 1)
    .option("inferSchema", False)
    .option("dataAddress", f"{sheet_name}!{sheet_range}")
    .load(sharepoint_site_url)
)

The path uses two wildcard levels: first to match a set of docs directories (ABC*), and then to match specific .xlsm files within a fixed subdirectory (files/ABC*.xlsm).

Our SharePoint site has around 5,000 directories, the vast majority of which contain no matching files. In our current dev environment there are only 10 files that satisfy the wildcard criteria, yet the load is consistently taking 40+ minutes to return them.

My assumption is that the connector is scanning all the files in my SharePoint, effectively making thousands of calls before a single file is read. Is that correct?

Has anyone found a way to speed this up? Are there any connector options, configuration settings, or recommended patterns for this kind of wildcard-heavy path? Any guidance would be appreciated.

bala_sai · 3 weeks ago

Yes I think the delay is likely coming from file discovery rather than reading the Excel files.

Even if only 10 files match in dev, Databricks still has to find them first. With "docs/ABC*/files/ABC*.xlsm", it can end up scanning a big chunk of the SharePoint folder before it gets to those 10 files. You can test it by pointing ".load()" to one known folder with one known file. If that comes back fast, then the issue is definitely the wildcard discovery.

You can try to avoid the multilevel wildcard if possible. Either point to a smaller fixed folder and use pathGlobFilter, or keep a small manifest of exact file URL's. If this runs regularly it is better to stage the files to cloud storage/UC Volume first and read from there instead of making SharePoint do the discovery every time.

balajij8 · 3 weeks ago

The standard SharePoint connector doesn't support folder path-based filtering well. When you use wildcards in the path itself (ABC*/files/ABC*.xlsm), the connector has to enumerate directories at the SharePoint level to resolve the patterns leading to making many API calls across 5,000 directories.

You can use pathGlobFilter instead of path wildcards

lakeflow_connection_name = 'sharepoint_dev'
sharepoint_site_url = 'https://example.sharepoint.com/sites/example_site/docs'

excel_df = (spark.read
    .format("excel")
    .option("databricks.connection", lakeflow_connection_name)
    .option("headerRows", 1)
    .option("inferSchema", False)
    .option("dataAddress", f"{sheet_name}!{sheet_range}")
    .option("pathGlobFilter", "ABC*/files/ABC*.xlsm")  # Filter here
    .load(sharepoint_site_url)
)

pathGlobFilter filters files by name after the connector retrieves the file list and is generally more efficient than path-level wildcards

Be more specific with paths - If you know the specific ABC directory names, query them explicitly in separate reads and union the results

target_dirs = ['ABC001', 'ABC002', 'ABC003']  # directories
dfs = []

for dir_name in target_dirs:
    path = f'https://example.sharepoint.com/sites/example_site/docs/{dir_name}/files/{dir_name}*.xlsm'
    df = spark.read.format("excel")...load(path)
    Add Append df code & use

Yogasathyandrun · 3 weeks ago

I think your diagnosis is likely correct.

One thing that stands out is that you’re only reading A1:Z2 from each workbook. Given that the operation is still taking 40+ minutes, the bottleneck is unlikely to be the Excel parsing itself and more likely to be file discovery.

With ~5,000 directories and a multilevel wildcard (ABC*/files/ABC*.xlsm), the connector may be spending most of its time resolving the matching paths before it ever starts reading data.

I’d also be cautious about relying on pathGlobFilter here. Even if it helps narrow file selection, the expensive part appears to be discovering the files in the first place.

As a quick validation, I’d try reading a few known paths explicitly and compare the runtime. If that drops significantly, then wildcard resolution is likely the dominant cost, and a manifest-driven or staged ingestion pattern may be a better long-term approach.

Data Engineer | Apache Spark | Delta Lake | Databricks

ConnorK · Wednesday

Hi all, thanks for the help.

Unfortunately pathGlobFilter didn't improve performance for me.

I ended up working around it with some pre-processing to figure out which specific folders I actually needed, then looping through those directories with the connector and combining the results into one dataframe. Each query now runs almost instantly, so the workaround does the job. Still no real fix for the original wildcard approach though.

Databricks Community

Databricks Standard SharePoint Connector Performance Issues

📌‌ Complete Your Profile – Help Others Get to Know You

The Next Wave of Enterprise AI | Webinar

Build apps without jumping through hoops

Solution Accelerator Series | Identify Fraud With Geospatial Analytics and AI

🌟 Community Pulse: Your Weekly Roundup! June 29 – July 05, 2026