<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Databricks Standard SharePoint Connector Performance Issues in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/databricks-standard-sharepoint-connector-performance-issues/m-p/160011#M54845</link>
    <description>&lt;P&gt;I've recently started using the Databricks Standard SharePoint connector within my workspace and have run into some significant performance issues.&lt;/P&gt;&lt;P&gt;My notebook does a straightforward read using the following:&lt;/P&gt;&lt;LI-CODE lang="python"&gt;lakeflow_connection_name = 'sharepoint_dev'
sharepoint_site_url = 'https://example.sharepoint.com/sites/example_site/docs/ABC*/files/ABC*.xlsm'
sheet_name = 'export'
sheet_range = 'A1:Z2'

excel_df = (spark.read
    .format("excel")
    .option("databricks.connection", lakeflow_connection_name)
    .option("headerRows", 1)
    .option("inferSchema", False)
    .option("dataAddress", f"{sheet_name}!{sheet_range}")
    .load(sharepoint_site_url)
)&lt;/LI-CODE&gt;&lt;P&gt;The path uses two wildcard levels: first to match a set of docs directories (ABC*), and then to match specific .xlsm files within a fixed subdirectory (files/ABC*.xlsm).&lt;/P&gt;&lt;P&gt;Our SharePoint site has around 5,000 directories, the vast majority of which contain no matching files. In our current dev environment there are only 10 files that satisfy the wildcard criteria, yet the load is consistently taking 40+ minutes to return them.&lt;/P&gt;&lt;P&gt;My assumption is that the connector is scanning all the files in my SharePoint, effectively making thousands of calls before a single file is read. Is that correct?&lt;/P&gt;&lt;P&gt;Has anyone found a way to speed this up? Are there any connector options, configuration settings, or recommended patterns for this kind of wildcard-heavy path? Any guidance would be appreciated.&lt;/P&gt;</description>
    <pubDate>Sun, 21 Jun 2026 21:15:30 GMT</pubDate>
    <dc:creator>ConnorK</dc:creator>
    <dc:date>2026-06-21T21:15:30Z</dc:date>
    <item>
      <title>Databricks Standard SharePoint Connector Performance Issues</title>
      <link>https://community.databricks.com/t5/data-engineering/databricks-standard-sharepoint-connector-performance-issues/m-p/160011#M54845</link>
      <description>&lt;P&gt;I've recently started using the Databricks Standard SharePoint connector within my workspace and have run into some significant performance issues.&lt;/P&gt;&lt;P&gt;My notebook does a straightforward read using the following:&lt;/P&gt;&lt;LI-CODE lang="python"&gt;lakeflow_connection_name = 'sharepoint_dev'
sharepoint_site_url = 'https://example.sharepoint.com/sites/example_site/docs/ABC*/files/ABC*.xlsm'
sheet_name = 'export'
sheet_range = 'A1:Z2'

excel_df = (spark.read
    .format("excel")
    .option("databricks.connection", lakeflow_connection_name)
    .option("headerRows", 1)
    .option("inferSchema", False)
    .option("dataAddress", f"{sheet_name}!{sheet_range}")
    .load(sharepoint_site_url)
)&lt;/LI-CODE&gt;&lt;P&gt;The path uses two wildcard levels: first to match a set of docs directories (ABC*), and then to match specific .xlsm files within a fixed subdirectory (files/ABC*.xlsm).&lt;/P&gt;&lt;P&gt;Our SharePoint site has around 5,000 directories, the vast majority of which contain no matching files. In our current dev environment there are only 10 files that satisfy the wildcard criteria, yet the load is consistently taking 40+ minutes to return them.&lt;/P&gt;&lt;P&gt;My assumption is that the connector is scanning all the files in my SharePoint, effectively making thousands of calls before a single file is read. Is that correct?&lt;/P&gt;&lt;P&gt;Has anyone found a way to speed this up? Are there any connector options, configuration settings, or recommended patterns for this kind of wildcard-heavy path? Any guidance would be appreciated.&lt;/P&gt;</description>
      <pubDate>Sun, 21 Jun 2026 21:15:30 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/databricks-standard-sharepoint-connector-performance-issues/m-p/160011#M54845</guid>
      <dc:creator>ConnorK</dc:creator>
      <dc:date>2026-06-21T21:15:30Z</dc:date>
    </item>
    <item>
      <title>Re: Databricks Standard SharePoint Connector Performance Issues</title>
      <link>https://community.databricks.com/t5/data-engineering/databricks-standard-sharepoint-connector-performance-issues/m-p/160019#M54846</link>
      <description>&lt;P&gt;Yes I think the delay is likely coming from file discovery rather than reading the Excel files.&lt;BR /&gt;&lt;BR /&gt;Even if only 10 files match in dev, Databricks still has to find them first. With "docs/ABC*/files/ABC*.xlsm", it can end up scanning a big chunk of the SharePoint folder before it gets to those 10 files. You can&amp;nbsp;test it by pointing ".load()" to one known folder with one known file. If that comes back fast, then the issue is definitely the wildcard discovery.&lt;/P&gt;&lt;P&gt;You can try to avoid the multilevel wildcard if possible. Either point to a smaller fixed folder and use pathGlobFilter, or keep a small manifest of exact file URL's. If this runs regularly it is better to stage the files to cloud storage/UC Volume first and read from there instead of making SharePoint do the discovery every time.&lt;/P&gt;</description>
      <pubDate>Mon, 22 Jun 2026 03:23:32 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/databricks-standard-sharepoint-connector-performance-issues/m-p/160019#M54846</guid>
      <dc:creator>bala_sai</dc:creator>
      <dc:date>2026-06-22T03:23:32Z</dc:date>
    </item>
    <item>
      <title>Re: Databricks Standard SharePoint Connector Performance Issues</title>
      <link>https://community.databricks.com/t5/data-engineering/databricks-standard-sharepoint-connector-performance-issues/m-p/160030#M54849</link>
      <description>&lt;P&gt;&lt;SPAN&gt;The&amp;nbsp;&lt;/SPAN&gt;standard SharePoint connector doesn't support folder path-based filtering&lt;SPAN&gt;&amp;nbsp;well. When you use wildcards in the path itself (&lt;/SPAN&gt;&lt;SPAN class=""&gt;ABC*/files/ABC*.xlsm&lt;/SPAN&gt;&lt;SPAN&gt;), the connector has to enumerate directories at the SharePoint level to resolve the patterns leading to making many API calls across 5,000 directories.&lt;/SPAN&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;SPAN&gt;You can&amp;nbsp;use&amp;nbsp;&lt;STRONG&gt;&lt;SPAN class=""&gt;pathGlobFilter&lt;/SPAN&gt;&lt;/STRONG&gt;&amp;nbsp;instead of path wildcards&lt;/SPAN&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;LI-CODE lang="python"&gt;lakeflow_connection_name = 'sharepoint_dev'
sharepoint_site_url = 'https://example.sharepoint.com/sites/example_site/docs'

excel_df = (spark.read
    .format("excel")
    .option("databricks.connection", lakeflow_connection_name)
    .option("headerRows", 1)
    .option("inferSchema", False)
    .option("dataAddress", f"{sheet_name}!{sheet_range}")
    .option("pathGlobFilter", "ABC*/files/ABC*.xlsm")  # Filter here
    .load(sharepoint_site_url)
)&lt;/LI-CODE&gt;&lt;P&gt;&lt;SPAN class=""&gt;pathGlobFilter&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;filters files by name&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;after&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;the connector retrieves the file list and is generally more efficient than path-level wildcards&lt;/SPAN&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;SPAN&gt;&lt;STRONG&gt;Be more specific with paths -&amp;nbsp;&lt;/STRONG&gt;&lt;/SPAN&gt;If you know the specific ABC directory names, query them explicitly in separate reads and union the results&lt;/LI&gt;&lt;/UL&gt;&lt;LI-CODE lang="python"&gt;target_dirs = ['ABC001', 'ABC002', 'ABC003']  # directories
dfs = []

for dir_name in target_dirs:
    path = f'https://example.sharepoint.com/sites/example_site/docs/{dir_name}/files/{dir_name}*.xlsm'
    df = spark.read.format("excel")...load(path)
    Add Append df code &amp;amp; use​&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 22 Jun 2026 05:18:21 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/databricks-standard-sharepoint-connector-performance-issues/m-p/160030#M54849</guid>
      <dc:creator>balajij8</dc:creator>
      <dc:date>2026-06-22T05:18:21Z</dc:date>
    </item>
    <item>
      <title>Re: Databricks Standard SharePoint Connector Performance Issues</title>
      <link>https://community.databricks.com/t5/data-engineering/databricks-standard-sharepoint-connector-performance-issues/m-p/160038#M54850</link>
      <description>&lt;P&gt;&lt;SPAN&gt;I think your diagnosis is likely correct.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;One thing that stands out is that you’re only reading &lt;/SPAN&gt;&lt;SPAN&gt;A1:Z2&lt;/SPAN&gt;&lt;SPAN&gt; from each workbook. Given that the operation is still taking 40+ minutes, the bottleneck is unlikely to be the Excel parsing itself and more likely to be file discovery.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;With ~5,000 directories and a multilevel wildcard (&lt;/SPAN&gt;&lt;SPAN&gt;ABC*/files/ABC*.xlsm&lt;/SPAN&gt;&lt;SPAN&gt;), the connector may be spending most of its time resolving the matching paths before it ever starts reading data.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;I’d also be cautious about relying on &lt;/SPAN&gt;&lt;SPAN&gt;pathGlobFilter&lt;/SPAN&gt;&lt;SPAN&gt; here. Even if it helps narrow file selection, the expensive part appears to be discovering the files in the first place.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;As a quick validation, I’d try reading a few known paths explicitly and compare the runtime. If that drops significantly, then wildcard resolution is likely the dominant cost, and a manifest-driven or staged ingestion pattern may be a better long-term approach.&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 22 Jun 2026 07:24:24 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/databricks-standard-sharepoint-connector-performance-issues/m-p/160038#M54850</guid>
      <dc:creator>Yogasathyandrun</dc:creator>
      <dc:date>2026-06-22T07:24:24Z</dc:date>
    </item>
  </channel>
</rss>

