Re: Databricks Standard SharePoint Connector Perfo...

bala_sai · Sunday

Yes I think the delay is likely coming from file discovery rather than reading the Excel files.

Even if only 10 files match in dev, Databricks still has to find them first. With "docs/ABC*/files/ABC*.xlsm", it can end up scanning a big chunk of the SharePoint folder before it gets to those 10 files. You can test it by pointing ".load()" to one known folder with one known file. If that comes back fast, then the issue is definitely the wildcard discovery.

You can try to avoid the multilevel wildcard if possible. Either point to a smaller fixed folder and use pathGlobFilter, or keep a small manifest of exact file URL's. If this runs regularly it is better to stage the files to cloud storage/UC Volume first and read from there instead of making SharePoint do the discovery every time.