cancel
Showing results for 
Search instead for 
Did you mean: 
Get Started Discussions
Start your journey with Databricks by joining discussions on getting started guides, tutorials, and introductory topics. Connect with beginners and experts alike to kickstart your Databricks experience.
cancel
Showing results for 
Search instead for 
Did you mean: 

Databricks UMF Best Practice

ChristianRRL
Valued Contributor III

Hi there, I would like to get some feedback on what are the ideal/suggested ways to get UMF data from our Azure cloud into Databricks. For context, UMF can mean either:

  • User Managed File
  • User Maintained File

Basically, a UMF could be something like a simple CSV that we know may change over time based on the latest file that the user uploads.

One method I'm exploring is using the Microsoft Graph API in order to pull user generated content wherever it may be (e.g. OneDrive, SharePoint, Teams, etc). However, before I proceed with using the Microsoft Graph API, I'd like to check if others in this community have found better/standard ways to pull in UMF data into Databricks.

1 REPLY 1

mark_ott
Databricks Employee
Databricks Employee

Several effective patterns exist for ingesting User Managed Files (UMF) such as CSVs from Azure into Databricks, each with different trade-offs depending on governance, user interface preferences, and integration with Microsoft 365 services.

Common Approaches

  • Direct Cloud Storage Integration: Many teams set up a dedicated Azure Blob Storage account (or Data Lake) as a landing zone for user uploads. Users place or update their UMFs in this storage, which is then registered as an external location or Unity Catalog volume in Databricks. Databricks Auto Loader or Databricks notebooks can efficiently pick up new or changed files for ingestion and transformation. This method can be managed with RBAC and integrates well with existing cloud security and orchestration patterns.​

  • Azure Data Factory (ADF): ADF offers a visual, scheduled way to copy files from a wide range of sources (SharePoint, OneDrive, etc.) into Azure storage or directly into Databricks. It handles authentication for Microsoft sources and provides built-in connectors, making it a strong option for low-code automated pipelines. Power Automate offers similar simplicity for lightweight use cases with strong Microsoft 365 integration.​

  • Microsoft Graph API: Useful when files reside in locations like SharePoint, OneDrive, or Teams, and when automation must reach user-scoped file collections dynamically. You can use Databricks (via dbutils, requests, or custom connectors) to fetch file binary content from Graph API, store them to DBFS/volumes, then process as needed. This approach is highly flexible but requires additional token management and error handling, though some community feedback views it as the most programmatically agile and direct solution. Pipedream and other workflow tools can help automate Microsoft Graph-to-Databricks integrations as well.​

  • Microsoft Graph Data Connect: For larger, organization-wide extractions from Microsoft 365, Graph Data Connect can bulk export data into your Azure storage for subsequent processing in Databricks. Typically more relevant for enterprise-scale scenarios.​

Best Practices

  • Use Unity Catalog Volumes: For governance and ease-of-access, volumes provide POSIX-style access in Databricks and are recommended for user-uploaded files. Unity Catalog also centrally manages permissions.​

  • Azure Authentication: Use Azure Managed Identities over hardcoding service credentials or manually refreshing tokens to securely access storage and Microsoft APIs.​

  • Automation & Scaling: For scenarios where users frequently update files, employ Databricks’ Auto Loader for incremental and event-based ingestion. Use orchestration (ADF, Azure Functions, Prefect/Airflow) for scheduled or event-driven jobs.​

Community/Industry Input

  • Many data engineering teams use ADF or manual cloud file drops due to their reliability, auditability, and tight integration with Azure security.​

  • Direct Microsoft Graph API use is valued for flexibility and real-time access to user-specific content, especially when file location or ownership is dynamic, but does involve more up-front engineering and ongoing authentication management.​

Summary Table

Approach Pros Cons Best For
Azure Blob/Data Lake + Volumes Simple, secure, scalable, well-governed Users need to upload manually or via UI Standardized file ingestion
ADF/Power Automate No/low code, MSFT connectors, scheduling May not be flexible for edge cases Scheduled/business process flows
Microsoft Graph API Real-time, flexible, user-scoped More dev effort, token management Dynamic/user-driven file pulls
Microsoft Graph Data Connect Enterprise-scale, bulk data export Not for small ad-hoc files Org-wide Office 365 data sync
 
 

For most corporate users, cloud storage + Unity Catalog volumes or ADF are the fastest route, with Microsoft Graph API reserved for advanced/dynamic scenarios

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now