Databricks Community

ChristianRRL · ‎03-24-2025

Hi there, I would like to get some feedback on what are the ideal/suggested ways to get UMF data from our Azure cloud into Databricks. For context, UMF can mean either:

User Managed File
User Maintained File

Basically, a UMF could be something like a simple CSV that we know may change over time based on the latest file that the user uploads.

One method I'm exploring is using the Microsoft Graph API in order to pull user generated content wherever it may be (e.g. OneDrive, SharePoint, Teams, etc). However, before I proceed with using the Microsoft Graph API, I'd like to check if others in this community have found better/standard ways to pull in UMF data into Databricks.

mark_ott · Sunday

Several effective patterns exist for ingesting User Managed Files (UMF) such as CSVs from Azure into Databricks, each with different trade-offs depending on governance, user interface preferences, and integration with Microsoft 365 services.

Common Approaches

Direct Cloud Storage Integration: Many teams set up a dedicated Azure Blob Storage account (or Data Lake) as a landing zone for user uploads. Users place or update their UMFs in this storage, which is then registered as an external location or Unity Catalog volume in Databricks. Databricks Auto Loader or Databricks notebooks can efficiently pick up new or changed files for ingestion and transformation. This method can be managed with RBAC and integrates well with existing cloud security and orchestration patterns.
Azure Data Factory (ADF): ADF offers a visual, scheduled way to copy files from a wide range of sources (SharePoint, OneDrive, etc.) into Azure storage or directly into Databricks. It handles authentication for Microsoft sources and provides built-in connectors, making it a strong option for low-code automated pipelines. Power Automate offers similar simplicity for lightweight use cases with strong Microsoft 365 integration.
Microsoft Graph API: Useful when files reside in locations like SharePoint, OneDrive, or Teams, and when automation must reach user-scoped file collections dynamically. You can use Databricks (via dbutils, requests, or custom connectors) to fetch file binary content from Graph API, store them to DBFS/volumes, then process as needed. This approach is highly flexible but requires additional token management and error handling, though some community feedback views it as the most programmatically agile and direct solution. Pipedream and other workflow tools can help automate Microsoft Graph-to-Databricks integrations as well.
Microsoft Graph Data Connect: For larger, organization-wide extractions from Microsoft 365, Graph Data Connect can bulk export data into your Azure storage for subsequent processing in Databricks. Typically more relevant for enterprise-scale scenarios.

Best Practices

Use Unity Catalog Volumes: For governance and ease-of-access, volumes provide POSIX-style access in Databricks and are recommended for user-uploaded files. Unity Catalog also centrally manages permissions.
Azure Authentication: Use Azure Managed Identities over hardcoding service credentials or manually refreshing tokens to securely access storage and Microsoft APIs.
Automation & Scaling: For scenarios where users frequently update files, employ Databricks’ Auto Loader for incremental and event-based ingestion. Use orchestration (ADF, Azure Functions, Prefect/Airflow) for scheduled or event-driven jobs.

Community/Industry Input

Many data engineering teams use ADF or manual cloud file drops due to their reliability, auditability, and tight integration with Azure security.
Direct Microsoft Graph API use is valued for flexibility and real-time access to user-specific content, especially when file location or ownership is dynamic, but does involve more up-front engineering and ongoing authentication management.

Summary Table

Approach	Pros	Cons	Best For
Azure Blob/Data Lake + Volumes	Simple, secure, scalable, well-governed	Users need to upload manually or via UI	Standardized file ingestion
ADF/Power Automate	No/low code, MSFT connectors, scheduling	May not be flexible for edge cases	Scheduled/business process flows
Microsoft Graph API	Real-time, flexible, user-scoped	More dev effort, token management	Dynamic/user-driven file pulls
Microsoft Graph Data Connect	Enterprise-scale, bulk data export	Not for small ad-hoc files	Org-wide Office 365 data sync

For most corporate users, cloud storage + Unity Catalog volumes or ADF are the fastest route, with Microsoft Graph API reserved for advanced/dynamic scenarios