2 weeks ago
Hi,
We’re currently running a Databricks installation with an on-premises HDFS file system. As we’re looking to adopt Unity Catalog, we’ve realized that our current HDFS setup has limited support and compatibility with Unity Catalog.
Our requirement: We need to keep all data on-premises (cannot move to cloud storage at this time).
Options we’re considering:
1. S3-compatible storage on-premises - Solutions like MinIO that we could host locally
2. Databricks Delta Sharing - Though we’re not entirely clear if this addresses our use case
Has anyone successfully implemented Unity Catalog with on-premises storage solutions?Any guidance from those who have navigated similar migrations would be greatly appreciated!
Thanks in advance for your help.
2 weeks ago
Unity Catalog does not natively support HDFS, and its primary design assumes cloud object storage (such as S3, ADLS, or GCS) as the backing store for both managed and external tables. For organizations restricted to on-premises storage, the situation is nuanced, but there are several real-world patterns emerging for enabling Unity Catalog (and, by extension, Databricks data governance features) when unable to use native cloud storage.
Using solutions like MinIO, which provide an S3-compatible API but are hosted on-premises, is considered one of the most viable workarounds. Many users have successfully configured Unity Catalog "external locations" to point to MinIO endpoints.
You'll need to configure Unity Catalog's external locations to use the S3a protocol and point to your MinIO endpoint.
Unity Catalog itself does not directly check whether the endpoint is Amazon S3, as long as the storage supports the S3 API. However, some users report issues with authentication, path-style access, and edge cases depending on their MinIO configuration.
The best practice is to thoroughly test metastore operations (CREATE TABLE, INSERT, SELECT, etc.) and ensure Delta format tables work as expected.
It is strongly recommended to use a recent version of MinIO and Databricks Runtime, as compatibility is evolving.
Delta Sharing is designed for secure sharing of Delta Lake data across organizations and platforms, not as a direct storage backend. You can set up a Delta Sharing server pointing at Delta tables stored on on-prem MinIO, but this is more about sharing data with other consumers, not about operationalizing Unity Catalog with on-prem data.
If your goal is central governance and access control on live data, Delta Sharing is a complement, not a replacement.
Multiple community threads and blog posts describe setting up Unity Catalog with MinIO or similar solutions, but caution that features like credential vending and audit logging may require extra validation.
Be aware of official support limitations: Databricks currently lists cloud object storage as the supported option, so on-prem S3-compatible solutions might not qualify for full production support.
Some integrators have attempted hybrid architectures, e.g., using Dremio or other middleware to bridge Unity Catalog and on-prem HDFS, but this introduces additional administrative complexity and is less commonly used for direct Databricks-Unity Catalog integration.
For your requirements, running MinIO on-premises and configuring it as an external S3 location is the most common and practical path today. Make sure network, authentication, and S3 API compatibility are robustly tested.
Unity Catalog with MinIO has been implemented in the field, but always check with Databricks support/your TAM for emerging compatibility updates.
Delta Sharing may play a role if you plan to securely share certain datasets with other teams or outside parties, but it does not address the core on-premises governance requirement you outlined.
In summary: MinIO (or an equivalent, S3-compatible object store) deployed on-premises is the most proven approach for enabling Unity Catalog features when you cannot use cloud storage, with several organizations reporting successful integrations, but expect to self-manage compatibility and support details.
2 weeks ago
Unity Catalog does not natively support HDFS, and its primary design assumes cloud object storage (such as S3, ADLS, or GCS) as the backing store for both managed and external tables. For organizations restricted to on-premises storage, the situation is nuanced, but there are several real-world patterns emerging for enabling Unity Catalog (and, by extension, Databricks data governance features) when unable to use native cloud storage.
Using solutions like MinIO, which provide an S3-compatible API but are hosted on-premises, is considered one of the most viable workarounds. Many users have successfully configured Unity Catalog "external locations" to point to MinIO endpoints.
You'll need to configure Unity Catalog's external locations to use the S3a protocol and point to your MinIO endpoint.
Unity Catalog itself does not directly check whether the endpoint is Amazon S3, as long as the storage supports the S3 API. However, some users report issues with authentication, path-style access, and edge cases depending on their MinIO configuration.
The best practice is to thoroughly test metastore operations (CREATE TABLE, INSERT, SELECT, etc.) and ensure Delta format tables work as expected.
It is strongly recommended to use a recent version of MinIO and Databricks Runtime, as compatibility is evolving.
Delta Sharing is designed for secure sharing of Delta Lake data across organizations and platforms, not as a direct storage backend. You can set up a Delta Sharing server pointing at Delta tables stored on on-prem MinIO, but this is more about sharing data with other consumers, not about operationalizing Unity Catalog with on-prem data.
If your goal is central governance and access control on live data, Delta Sharing is a complement, not a replacement.
Multiple community threads and blog posts describe setting up Unity Catalog with MinIO or similar solutions, but caution that features like credential vending and audit logging may require extra validation.
Be aware of official support limitations: Databricks currently lists cloud object storage as the supported option, so on-prem S3-compatible solutions might not qualify for full production support.
Some integrators have attempted hybrid architectures, e.g., using Dremio or other middleware to bridge Unity Catalog and on-prem HDFS, but this introduces additional administrative complexity and is less commonly used for direct Databricks-Unity Catalog integration.
For your requirements, running MinIO on-premises and configuring it as an external S3 location is the most common and practical path today. Make sure network, authentication, and S3 API compatibility are robustly tested.
Unity Catalog with MinIO has been implemented in the field, but always check with Databricks support/your TAM for emerging compatibility updates.
Delta Sharing may play a role if you plan to securely share certain datasets with other teams or outside parties, but it does not address the core on-premises governance requirement you outlined.
In summary: MinIO (or an equivalent, S3-compatible object store) deployed on-premises is the most proven approach for enabling Unity Catalog features when you cannot use cloud storage, with several organizations reporting successful integrations, but expect to self-manage compatibility and support details.
Wednesday
Thanks very much for your detailed response — this is really helpful.
You mentioned client cases where organizations have migrated from on-premises HDFS into the Databricks Unity Catalog, I’d love to learn more about those.
If possible, could you share links to those client case studies (white-papers, blog posts, or success stories) or contacts who I might reach out to (with respect to NDA / referencing constraints)
Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!
Sign Up Now