cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Databricks UMF Best Practice

ChristianRRL
Valued Contributor

Hi there, I would like to get some feedback on what are the ideal/suggested ways to get UMF data from our Azure cloud into Databricks. For context, UMF can mean either:

  • User Managed File
  • User Maintained File

Basically, a UMF could be something like a simple CSV that we know may change over time based on the latest file that the user uploads.

One method I'm exploring is using the Microsoft Graph API in order to pull user generated content wherever it may be (e.g. OneDrive, SharePoint, Teams, etc). However, before I proceed with using the Microsoft Graph API, I'd like to check if others in this community have found better/standard ways to pull in UMF data into Databricks.

3 REPLIES 3

ChristianRRL
Valued Contributor

Hi there, checking back in here. Can someone help provide some feedback on my post?

@NandiniN / @raphaelblg 

saisaran_g
Contributor

Hello Mate,

I tried similar approach in my workspace but not sure how it might help you, just sharing my approach : 

There is a google spread sheet maintained by GTM/Sales team and always changes will be on the sheet by updating many fields daily, they would like to see the analytics and few metric on this data. I proposed the solution that based on their first timeline like early in the morning data report then I will refresh the my job at 30 mints before using the API read ( google spread sheet api), based on this we are refreshing every 12 hrs. You can ignore if this not help.

Thank you for the ask.

Happy Learning and solve new errors :
Saran

BigRoux
Databricks Employee
Databricks Employee
I am not an expert on this topic or Azure services but I did some research and have some suggested courses of action for you to test out.  To address your request for suggested ways to get User Managed Files (UMF) from Azure into Databricks, here are some key approaches and perspectives based on the context and search results:
 
Suggested Approaches for Ingesting UMF Data: 1. Microsoft Graph API: - Using the Microsoft Graph API for retrieving user-generated content, such as files from OneDrive, SharePoint, and Teams, is a viable option. - Challenges specific to API usage in serverless environments were noted in Slack discussions, including cases where API calls from within Databricks mapInPandas functions resulted in tasks stalling. For example, the discussion highlighted that serverless clusters may have restricted outbound internet connectivity by default. You can reference the details mentioned in Slack threads for more context (link).
  1. Azure Data Factory (ADF):
    • ADF can be used to ingest files from Azure services directly into Azure Data Lake Storage (ADLS), from which Databricks can pick up UMF data. ADF supports connectors for extracting from sources like SharePoint or OneDrive and can be combined with Databricks for further processing.
    • For such UMF ingestion, setting up data pipelines or configuring triggers within ADF can automate the ETL process. ADF’s OData connectors might support tasks involving changing files.
  2. Databricks Auto Loader:
    • Auto Loader can be an effective tool to ingest dynamically changing UMF files into Delta Lake. It supports continuous file monitoring in ADLS or Blob storage and is highly scalable for workflows where users upload new files regularly.
    • For example: python df = spark.readStream.format("cloudFiles") \ .option("cloudFiles.format", "csv") \ .load("path_to_azure_storage")
  3. Unity Catalog with External Tables or Volumes:
    • Depending on how the files are stored, external locations or volumes registered in Unity Catalog can manage access and governance for UMF data. You might also consider using Volumes to provide a space for handling raw UMF data before processing.
  4. Integration Using Python Libraries:
    • Databricks supports accessing Azure storage layers through standard libraries like Azure SDK. This facilitates scripting for downloading/uploading dynamic files (e.g., CSV) into Databricks workspaces.
    • Example Python code to fetch files: ```python from azure.identity import DefaultAzureCredential from azure.storage.blob import BlobServiceClient
    blob_service_client = BlobServiceClient(account_url="https://<storage_acct>.blob.core.windows.net", credential=DefaultAzureCredential()) container_client = blob_service_client.get_container_client("container_name")
    blob_list = container_client.list_blobs() for blob in blob_list: # Logic for file filtering based on modification download_stream = container_client.download_blob(blob.name) with open(blob.name, "wb") as file: file.write(download_stream.readall()) ```
Recommendations: - Best Method Depends on Use Case: If your UMF data resides in OneDrive or SharePoint, leveraging the Graph API might be one of the better options, provided that any potential bottlenecks (e.g., serverless tasks) are resolved. For ADLS or Blob storage, Auto Loader and Databricks-native integration tools offer streamlined solutions. - Consider Governance, Scalability & Security: Ensure clear access policies for sensitive modifications and utilize Azure features like Private Link or ADLS Gen2-specific mechanisms. - Continuous Improvement: If initial tests indicate performance or reliability issues, explore tools such as Azure Data Factory or third-party solutions like Qlik Replicate, which has demonstrated strong integration capabilities with Databricks and Azure ecosystems.