<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Databricks UMF Best Practice in Get Started Discussions</title>
    <link>https://community.databricks.com/t5/get-started-discussions/databricks-umf-best-practice/m-p/113418#M4892</link>
    <description>&lt;P&gt;Hi there, I would like to get some feedback on what are the ideal/suggested ways to get UMF data from our Azure cloud into Databricks. For context, UMF can mean either:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;User Managed File&lt;/LI&gt;&lt;LI&gt;User Maintained File&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;Basically, a UMF could be something like a simple CSV that we know may change over time based on the latest file that the user uploads.&lt;/P&gt;&lt;P&gt;One method I'm exploring is using the Microsoft Graph API in order to pull user generated content wherever it may be (e.g. OneDrive, SharePoint, Teams, etc). However, before I proceed with using the Microsoft Graph API, I'd like to check if others in this community have found better/standard ways to pull in UMF data into Databricks.&lt;/P&gt;</description>
    <pubDate>Mon, 24 Mar 2025 13:49:38 GMT</pubDate>
    <dc:creator>ChristianRRL</dc:creator>
    <dc:date>2025-03-24T13:49:38Z</dc:date>
    <item>
      <title>Databricks UMF Best Practice</title>
      <link>https://community.databricks.com/t5/get-started-discussions/databricks-umf-best-practice/m-p/113418#M4892</link>
      <description>&lt;P&gt;Hi there, I would like to get some feedback on what are the ideal/suggested ways to get UMF data from our Azure cloud into Databricks. For context, UMF can mean either:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;User Managed File&lt;/LI&gt;&lt;LI&gt;User Maintained File&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;Basically, a UMF could be something like a simple CSV that we know may change over time based on the latest file that the user uploads.&lt;/P&gt;&lt;P&gt;One method I'm exploring is using the Microsoft Graph API in order to pull user generated content wherever it may be (e.g. OneDrive, SharePoint, Teams, etc). However, before I proceed with using the Microsoft Graph API, I'd like to check if others in this community have found better/standard ways to pull in UMF data into Databricks.&lt;/P&gt;</description>
      <pubDate>Mon, 24 Mar 2025 13:49:38 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/databricks-umf-best-practice/m-p/113418#M4892</guid>
      <dc:creator>ChristianRRL</dc:creator>
      <dc:date>2025-03-24T13:49:38Z</dc:date>
    </item>
    <item>
      <title>Re: Databricks UMF Best Practice</title>
      <link>https://community.databricks.com/t5/get-started-discussions/databricks-umf-best-practice/m-p/139233#M11028</link>
      <description>&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;Several effective patterns exist for ingesting User Managed Files (UMF) such as CSVs from Azure into Databricks, each with different trade-offs depending on governance, user interface preferences, and integration with Microsoft 365 services.&lt;/P&gt;
&lt;H2 class="mb-2 mt-4 font-display font-semimedium text-base first:mt-0"&gt;Common Approaches&lt;/H2&gt;
&lt;UL class="marker:text-quiet list-disc"&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;&lt;STRONG&gt;Direct Cloud Storage Integration:&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;Many teams set up a dedicated Azure Blob Storage account (or Data Lake) as a landing zone for user uploads. Users place or update their UMFs in this storage, which is then registered as an external location or Unity Catalog volume in Databricks. Databricks Auto Loader or Databricks notebooks can efficiently pick up new or changed files for ingestion and transformation. This method can be managed with RBAC and integrates well with existing cloud security and orchestration patterns.​&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;&lt;STRONG&gt;Azure Data Factory (ADF):&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;ADF offers a visual, scheduled way to copy files from a wide range of sources (SharePoint, OneDrive, etc.) into Azure storage or directly into Databricks. It handles authentication for Microsoft sources and provides built-in connectors, making it a strong option for low-code automated pipelines. Power Automate offers similar simplicity for lightweight use cases with strong Microsoft 365 integration.​&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;&lt;STRONG&gt;Microsoft Graph API:&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;Useful when files reside in locations like SharePoint, OneDrive, or Teams, and when automation must reach user-scoped file collections dynamically. You can use Databricks (via dbutils, requests, or custom connectors) to fetch file binary content from Graph API, store them to DBFS/volumes, then process as needed. This approach is highly flexible but requires additional token management and error handling, though some community feedback views it as the most programmatically agile and direct solution. Pipedream and other workflow tools can help automate Microsoft Graph-to-Databricks integrations as well.​&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;&lt;STRONG&gt;Microsoft Graph Data Connect:&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;For larger, organization-wide extractions from Microsoft 365, Graph Data Connect can bulk export data into your Azure storage for subsequent processing in Databricks. Typically more relevant for enterprise-scale scenarios.​&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;H2 class="mb-2 mt-4 font-display font-semimedium text-base first:mt-0"&gt;Best Practices&lt;/H2&gt;
&lt;UL class="marker:text-quiet list-disc"&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;&lt;STRONG&gt;Use Unity Catalog Volumes:&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;For governance and ease-of-access, volumes provide POSIX-style access in Databricks and are recommended for user-uploaded files. Unity Catalog also centrally manages permissions.​&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;&lt;STRONG&gt;Azure Authentication:&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;Use Azure Managed Identities over hardcoding service credentials or manually refreshing tokens to securely access storage and Microsoft APIs.​&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;&lt;STRONG&gt;Automation &amp;amp; Scaling:&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;For scenarios where users frequently update files, employ Databricks’ Auto Loader for incremental and event-based ingestion. Use orchestration (ADF, Azure Functions, Prefect/Airflow) for scheduled or event-driven jobs.​&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;H2 class="mb-2 mt-4 font-display font-semimedium text-base first:mt-0"&gt;Community/Industry Input&lt;/H2&gt;
&lt;UL class="marker:text-quiet list-disc"&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;Many data engineering teams use ADF or manual cloud file drops due to their reliability, auditability, and tight integration with Azure security.​&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;Direct Microsoft Graph API use is valued for flexibility and real-time access to user-specific content, especially when file location or ownership is dynamic, but does involve more up-front engineering and ongoing authentication management.​&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;H2 class="mb-2 mt-4 font-display font-semimedium text-base first:mt-0"&gt;Summary Table&lt;/H2&gt;
&lt;DIV class="group relative"&gt;
&lt;DIV class="w-full overflow-x-auto md:max-w-[90vw] border-subtlest ring-subtlest divide-subtlest bg-transparent"&gt;
&lt;TABLE class="border-subtler my-[1em] w-full table-auto border-separate border-spacing-0 border-l border-t"&gt;
&lt;THEAD class="bg-subtler"&gt;
&lt;TR&gt;
&lt;TH class="border-subtler p-sm break-normal border-b border-r text-left align-top"&gt;Approach&lt;/TH&gt;
&lt;TH class="border-subtler p-sm break-normal border-b border-r text-left align-top"&gt;Pros&lt;/TH&gt;
&lt;TH class="border-subtler p-sm break-normal border-b border-r text-left align-top"&gt;Cons&lt;/TH&gt;
&lt;TH class="border-subtler p-sm break-normal border-b border-r text-left align-top"&gt;Best For&lt;/TH&gt;
&lt;/TR&gt;
&lt;/THEAD&gt;
&lt;TBODY&gt;
&lt;TR&gt;
&lt;TD class="px-sm border-subtler min-w-[48px] break-normal border-b border-r"&gt;Azure Blob/Data Lake + Volumes&lt;/TD&gt;
&lt;TD class="px-sm border-subtler min-w-[48px] break-normal border-b border-r"&gt;Simple, secure, scalable, well-governed&lt;/TD&gt;
&lt;TD class="px-sm border-subtler min-w-[48px] break-normal border-b border-r"&gt;Users need to upload manually or via UI&lt;/TD&gt;
&lt;TD class="px-sm border-subtler min-w-[48px] break-normal border-b border-r"&gt;Standardized file ingestion&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD class="px-sm border-subtler min-w-[48px] break-normal border-b border-r"&gt;ADF/Power Automate&lt;/TD&gt;
&lt;TD class="px-sm border-subtler min-w-[48px] break-normal border-b border-r"&gt;No/low code, MSFT connectors, scheduling&lt;/TD&gt;
&lt;TD class="px-sm border-subtler min-w-[48px] break-normal border-b border-r"&gt;May not be flexible for edge cases&lt;/TD&gt;
&lt;TD class="px-sm border-subtler min-w-[48px] break-normal border-b border-r"&gt;Scheduled/business process flows&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD class="px-sm border-subtler min-w-[48px] break-normal border-b border-r"&gt;Microsoft Graph API&lt;/TD&gt;
&lt;TD class="px-sm border-subtler min-w-[48px] break-normal border-b border-r"&gt;Real-time, flexible, user-scoped&lt;/TD&gt;
&lt;TD class="px-sm border-subtler min-w-[48px] break-normal border-b border-r"&gt;More dev effort, token management&lt;/TD&gt;
&lt;TD class="px-sm border-subtler min-w-[48px] break-normal border-b border-r"&gt;Dynamic/user-driven file pulls&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD class="px-sm border-subtler min-w-[48px] break-normal border-b border-r"&gt;Microsoft Graph Data Connect&lt;/TD&gt;
&lt;TD class="px-sm border-subtler min-w-[48px] break-normal border-b border-r"&gt;Enterprise-scale, bulk data export&lt;/TD&gt;
&lt;TD class="px-sm border-subtler min-w-[48px] break-normal border-b border-r"&gt;Not for small ad-hoc files&lt;/TD&gt;
&lt;TD class="px-sm border-subtler min-w-[48px] break-normal border-b border-r"&gt;Org-wide Office 365 data sync&lt;/TD&gt;
&lt;/TR&gt;
&lt;/TBODY&gt;
&lt;/TABLE&gt;
&lt;/DIV&gt;
&lt;DIV class="bg-base border-subtler shadow-subtle pointer-coarse:opacity-100 right-xs absolute bottom-0 flex rounded-lg border opacity-0 transition-opacity group-hover:opacity-100 [&amp;amp;&amp;gt;*:not(:first-child)]:border-subtle [&amp;amp;&amp;gt;*:not(:first-child)]:border-l"&gt;
&lt;DIV class="flex"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;DIV class="flex"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;For most corporate users, cloud storage + Unity Catalog volumes or ADF are the fastest route, with Microsoft Graph API reserved for advanced/dynamic scenarios&lt;/P&gt;</description>
      <pubDate>Sun, 16 Nov 2025 17:21:06 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/databricks-umf-best-practice/m-p/139233#M11028</guid>
      <dc:creator>mark_ott</dc:creator>
      <dc:date>2025-11-16T17:21:06Z</dc:date>
    </item>
  </channel>
</rss>

