cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Migrating from on-premises HDFS to Unity Catalog - Looking for advice on on-prem options

Sergecom
New Contributor III

Hi,
We’re currently running a Databricks installation with an on-premises HDFS file system. As we’re looking to adopt Unity Catalog, we’ve realized that our current HDFS setup has limited support and compatibility with Unity Catalog.
Our requirement: We need to keep all data on-premises (cannot move to cloud storage at this time).
Options we’re considering:
1. S3-compatible storage on-premises - Solutions like MinIO that we could host locally
2. Databricks Delta Sharing - Though we’re not entirely clear if this addresses our use case

Has anyone successfully implemented Unity Catalog with on-premises storage solutions?Any guidance from those who have navigated similar migrations would be greatly appreciated!


Thanks in advance for your help.

1 ACCEPTED SOLUTION

Accepted Solutions

mark_ott
Databricks Employee
Databricks Employee

Unity Catalog does not natively support HDFS, and its primary design assumes cloud object storage (such as S3, ADLS, or GCS) as the backing store for both managed and external tables. For organizations restricted to on-premises storage, the situation is nuanced, but there are several real-world patterns emerging for enabling Unity Catalog (and, by extension, Databricks data governance features) when unable to use native cloud storage.​

S3-Compatible Storage On-Premises (e.g., MinIO)

  • Using solutions like MinIO, which provide an S3-compatible API but are hosted on-premises, is considered one of the most viable workarounds. Many users have successfully configured Unity Catalog "external locations" to point to MinIO endpoints.​

  • You'll need to configure Unity Catalog's external locations to use the S3a protocol and point to your MinIO endpoint.​

  • Unity Catalog itself does not directly check whether the endpoint is Amazon S3, as long as the storage supports the S3 API. However, some users report issues with authentication, path-style access, and edge cases depending on their MinIO configuration.​

  • The best practice is to thoroughly test metastore operations (CREATE TABLE, INSERT, SELECT, etc.) and ensure Delta format tables work as expected.

  • It is strongly recommended to use a recent version of MinIO and Databricks Runtime, as compatibility is evolving.

Delta Sharing

  • Delta Sharing is designed for secure sharing of Delta Lake data across organizations and platforms, not as a direct storage backend. You can set up a Delta Sharing server pointing at Delta tables stored on on-prem MinIO, but this is more about sharing data with other consumers, not about operationalizing Unity Catalog with on-prem data.​

  • If your goal is central governance and access control on live data, Delta Sharing is a complement, not a replacement.

Community Usage & Reports

  • Multiple community threads and blog posts describe setting up Unity Catalog with MinIO or similar solutions, but caution that features like credential vending and audit logging may require extra validation.​

  • Be aware of official support limitations: Databricks currently lists cloud object storage as the supported option, so on-prem S3-compatible solutions might not qualify for full production support.​

Direct HDFS and Hybrid Options

  • Some integrators have attempted hybrid architectures, e.g., using Dremio or other middleware to bridge Unity Catalog and on-prem HDFS, but this introduces additional administrative complexity and is less commonly used for direct Databricks-Unity Catalog integration.​

Practical Steps & Recommendations

  • For your requirements, running MinIO on-premises and configuring it as an external S3 location is the most common and practical path today. Make sure network, authentication, and S3 API compatibility are robustly tested.​

  • Unity Catalog with MinIO has been implemented in the field, but always check with Databricks support/your TAM for emerging compatibility updates.

  • Delta Sharing may play a role if you plan to securely share certain datasets with other teams or outside parties, but it does not address the core on-premises governance requirement you outlined.​

In summary: MinIO (or an equivalent, S3-compatible object store) deployed on-premises is the most proven approach for enabling Unity Catalog features when you cannot use cloud storage, with several organizations reporting successful integrations, but expect to self-manage compatibility and support details.​

View solution in original post

2 REPLIES 2

mark_ott
Databricks Employee
Databricks Employee

Unity Catalog does not natively support HDFS, and its primary design assumes cloud object storage (such as S3, ADLS, or GCS) as the backing store for both managed and external tables. For organizations restricted to on-premises storage, the situation is nuanced, but there are several real-world patterns emerging for enabling Unity Catalog (and, by extension, Databricks data governance features) when unable to use native cloud storage.​

S3-Compatible Storage On-Premises (e.g., MinIO)

  • Using solutions like MinIO, which provide an S3-compatible API but are hosted on-premises, is considered one of the most viable workarounds. Many users have successfully configured Unity Catalog "external locations" to point to MinIO endpoints.​

  • You'll need to configure Unity Catalog's external locations to use the S3a protocol and point to your MinIO endpoint.​

  • Unity Catalog itself does not directly check whether the endpoint is Amazon S3, as long as the storage supports the S3 API. However, some users report issues with authentication, path-style access, and edge cases depending on their MinIO configuration.​

  • The best practice is to thoroughly test metastore operations (CREATE TABLE, INSERT, SELECT, etc.) and ensure Delta format tables work as expected.

  • It is strongly recommended to use a recent version of MinIO and Databricks Runtime, as compatibility is evolving.

Delta Sharing

  • Delta Sharing is designed for secure sharing of Delta Lake data across organizations and platforms, not as a direct storage backend. You can set up a Delta Sharing server pointing at Delta tables stored on on-prem MinIO, but this is more about sharing data with other consumers, not about operationalizing Unity Catalog with on-prem data.​

  • If your goal is central governance and access control on live data, Delta Sharing is a complement, not a replacement.

Community Usage & Reports

  • Multiple community threads and blog posts describe setting up Unity Catalog with MinIO or similar solutions, but caution that features like credential vending and audit logging may require extra validation.​

  • Be aware of official support limitations: Databricks currently lists cloud object storage as the supported option, so on-prem S3-compatible solutions might not qualify for full production support.​

Direct HDFS and Hybrid Options

  • Some integrators have attempted hybrid architectures, e.g., using Dremio or other middleware to bridge Unity Catalog and on-prem HDFS, but this introduces additional administrative complexity and is less commonly used for direct Databricks-Unity Catalog integration.​

Practical Steps & Recommendations

  • For your requirements, running MinIO on-premises and configuring it as an external S3 location is the most common and practical path today. Make sure network, authentication, and S3 API compatibility are robustly tested.​

  • Unity Catalog with MinIO has been implemented in the field, but always check with Databricks support/your TAM for emerging compatibility updates.

  • Delta Sharing may play a role if you plan to securely share certain datasets with other teams or outside parties, but it does not address the core on-premises governance requirement you outlined.​

In summary: MinIO (or an equivalent, S3-compatible object store) deployed on-premises is the most proven approach for enabling Unity Catalog features when you cannot use cloud storage, with several organizations reporting successful integrations, but expect to self-manage compatibility and support details.​

Sergecom
New Contributor III

Thanks very much for your detailed response — this is really helpful.
You mentioned client cases where organizations have migrated from on-premises HDFS into the Databricks Unity Catalog, I’d love to learn more about those.

If possible, could you share links to those client case studies (white-papers, blog posts, or success stories) or contacts who I might reach out to (with respect to NDA / referencing constraints) 

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now