We are currently setting up Azure Databricks for enterprise analytics and wanted to validate our storage architecture against Databricks best practices.
Today, we are ingesting data directly from external enterprise sources (Oracle DB, SQL Server, etc.) using Databricks connectors, and the data is landing in the Databricks managed storage account (DBFS / ADLS in the managed resource group) created along with the workspace.
We do not currently have a customerโowned external ADLS Gen2 configured as a centralized enterprise data lake.
I understand from Databricks documentation that managed storage is primarily intended for workspace/internal use (logs, libraries, temp data, internal tables), and that for production and enterprise data, Databricks recommends using a customerโowned ADLS Gen2 accessed via Unity Catalog external locations.
My questions are:
- Is it recommended to store enterprise production data from external systems in Databricks managed storage (even when it is ADLS Gen2 with HNS enabled)?
- For enterpriseโscale deployments with multiple workspaces and downstream consumers (ADF, Fabric), is a single customerโowned ADLS Gen2 the recommended system of record?
- Would Databricks consider managed storage an acceptable longโterm data lake, or should it be treated strictly as workspaceโscoped/internal storage?
Any clarification or confirmation from Databricks engineers or the community would be greatly appreciated.