cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Administration & Architecture
Explore discussions on Databricks administration, deployment strategies, and architectural best practices. Connect with administrators and architects to optimize your Databricks environment for performance, scalability, and security.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Best practice:Using Databricks managed storage vs customerโ€‘owned ADLS for enterprise production data

LokeshChikuru
Databricks Partner

We are currently setting up Azure Databricks for enterprise analytics and wanted to validate our storage architecture against Databricks best practices.

Today, we are ingesting data directly from external enterprise sources (Oracle DB, SQL Server, etc.) using Databricks connectors, and the data is landing in the Databricks managed storage account (DBFS / ADLS in the managed resource group) created along with the workspace.

We do not currently have a customerโ€‘owned external ADLS Gen2 configured as a centralized enterprise data lake.

I understand from Databricks documentation that managed storage is primarily intended for workspace/internal use (logs, libraries, temp data, internal tables), and that for production and enterprise data, Databricks recommends using a customerโ€‘owned ADLS Gen2 accessed via Unity Catalog external locations.

My questions are:

  1. Is it recommended to store enterprise production data from external systems in Databricks managed storage (even when it is ADLS Gen2 with HNS enabled)?
  2. For enterpriseโ€‘scale deployments with multiple workspaces and downstream consumers (ADF, Fabric), is a single customerโ€‘owned ADLS Gen2 the recommended system of record?
  3. Would Databricks consider managed storage an acceptable longโ€‘term data lake, or should it be treated strictly as workspaceโ€‘scoped/internal storage?

Any clarification or confirmation from Databricks engineers or the community would be greatly appreciated.

1 ACCEPTED SOLUTION

Accepted Solutions

emma_s
Databricks Employee
Databricks Employee

Hey, your research is correct. The DBFS is for logs and inner databricks workings, not for your production data. We would recommend having your own ADLS Gen2 storage container for all your production data. The DBFS is available to all users and has no governance over it. You would need to set up the ADLS storage container and then register it as Managed Storage.

You would then want to have managed tables storing the data in the ADLS Gen2 storage. Important to note these are not the same as managed storage. There is a choice between managed and external tables but we tend to recommend managed tables as it has improved security and a lot of out of the box optimisation. If you have other tools that need to access this data they can do this via the API.

I hope this helps. Let me know if anything doesn't make sense.

Here are a couple of docs with more info https://learn.microsoft.com/en-us/azure/databricks/data-governance/unity-catalog/best-practices

https://learn.microsoft.com/en-us/azure/databricks/lakehouse-architecture/deployment-guide/storage

 

Thanks,

Emma

View solution in original post

2 REPLIES 2

emma_s
Databricks Employee
Databricks Employee

Hey, your research is correct. The DBFS is for logs and inner databricks workings, not for your production data. We would recommend having your own ADLS Gen2 storage container for all your production data. The DBFS is available to all users and has no governance over it. You would need to set up the ADLS storage container and then register it as Managed Storage.

You would then want to have managed tables storing the data in the ADLS Gen2 storage. Important to note these are not the same as managed storage. There is a choice between managed and external tables but we tend to recommend managed tables as it has improved security and a lot of out of the box optimisation. If you have other tools that need to access this data they can do this via the API.

I hope this helps. Let me know if anything doesn't make sense.

Here are a couple of docs with more info https://learn.microsoft.com/en-us/azure/databricks/data-governance/unity-catalog/best-practices

https://learn.microsoft.com/en-us/azure/databricks/lakehouse-architecture/deployment-guide/storage

 

Thanks,

Emma

LokeshChikuru
Databricks Partner

@emma_s 

Thanks for the update and appreciate the immediate response.