Databricks UMF Best Practice
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-12-2025 07:17 AM
Hi there, I would like to get some feedback on what are the ideal/suggested ways to get UMF data from our Azure cloud into Databricks. For context, UMF can mean either:
- User Managed File
- User Maintained File
Basically, a UMF could be something like a simple CSV that we know may change over time based on the latest file that the user uploads.
One method I'm exploring is using the Microsoft Graph API in order to pull user generated content wherever it may be (e.g. OneDrive, SharePoint, Teams, etc). However, before I proceed with using the Microsoft Graph API, I'd like to check if others in this community have found better/standard ways to pull in UMF data into Databricks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-17-2025 09:14 AM
Hi there, checking back in here. Can someone help provide some feedback on my post?
+ @NandiniN / @raphaelblg
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
3 weeks ago
Hello Mate,
I tried similar approach in my workspace but not sure how it might help you, just sharing my approach :
There is a google spread sheet maintained by GTM/Sales team and always changes will be on the sheet by updating many fields daily, they would like to see the analytics and few metric on this data. I proposed the solution that based on their first timeline like early in the morning data report then I will refresh the my job at 30 mints before using the API read ( google spread sheet api), based on this we are refreshing every 12 hrs. You can ignore if this not help.
Thank you for the ask.
Saran
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
2 weeks ago
-
Azure Data Factory (ADF):
- ADF can be used to ingest files from Azure services directly into Azure Data Lake Storage (ADLS), from which Databricks can pick up UMF data. ADF supports connectors for extracting from sources like SharePoint or OneDrive and can be combined with Databricks for further processing.
- For such UMF ingestion, setting up data pipelines or configuring triggers within ADF can automate the ETL process. ADF’s OData connectors might support tasks involving changing files.
-
Databricks Auto Loader:
- Auto Loader can be an effective tool to ingest dynamically changing UMF files into Delta Lake. It supports continuous file monitoring in ADLS or Blob storage and is highly scalable for workflows where users upload new files regularly.
- For example:
python df = spark.readStream.format("cloudFiles") \ .option("cloudFiles.format", "csv") \ .load("path_to_azure_storage")
-
Unity Catalog with External Tables or Volumes:
- Depending on how the files are stored, external locations or volumes registered in Unity Catalog can manage access and governance for UMF data. You might also consider using Volumes to provide a space for handling raw UMF data before processing.
-
Integration Using Python Libraries:
- Databricks supports accessing Azure storage layers through standard libraries like Azure SDK. This facilitates scripting for downloading/uploading dynamic files (e.g., CSV) into Databricks workspaces.
- Example Python code to fetch files: ```python from azure.identity import DefaultAzureCredential from azure.storage.blob import BlobServiceClient
blob_service_client = BlobServiceClient(account_url="https://<storage_acct>.blob.core.windows.net", credential=DefaultAzureCredential()) container_client = blob_service_client.get_container_client("container_name")blob_list = container_client.list_blobs() for blob in blob_list: # Logic for file filtering based on modification download_stream = container_client.download_blob(blob.name) with open(blob.name, "wb") as file: file.write(download_stream.readall()) ```

