cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Registering Delta tables from external storage GCS , S3 , Azure Blob in Databricks Unity Catalog

muaaz
Visitor

Hi everyone,

I am currently working on a migration project from Azure Databricks to GCP Databricks, and I need some guidance from the community on best practices around registering external Delta tables into Unity Catalog.

Currenlty I am doing this but need more optimized and best approach.

CREATE TABLE IF NOT EXISTS catalog.schema.table
USING DELTA
LOCATION 'gs://bucket/tables/UUID';
5 REPLIES 5

amirabedhiafi
Visitor

 

Hello muaaz !

I don't think that's the best pattern to follow and I recommend you to avoid registering tables directly from gs://... paths and instead put the data under a UC external location backed by a storage credential. When I check the doc, the best way is that external table paths should live inside an existing external location and it is better organizing this as one external location per schema for external table use cases.

Best case scenario is also to use managed tables by default unless you specifically need in place access to existing data or direct access from non Databricks tools.

So try :

 

CREATE TABLE catalog.schema.table
USING DELTA
LOCATION 'gs://my-bucket/external/schema_name/table_name';

with that location living inside a precreated UC external location such as:

CREATE EXTERNAL LOCATION ext_schema_name
URL 'gs://my-bucket/external/schema_name/'
WITH (STORAGE CREDENTIAL gcs_credential);

https://docs.databricks.com/gcp/en/connect/unity-catalog/cloud-storage

 

Ashwin_DSA
Databricks Employee
Databricks Employee

Hi @muaaz,

On GCP Databricks, the SQL pattern you are using is fine, but the recommended best practice is to back it with a Unity Catalog external location instead of pointing tables directly at arbitrary gs:// paths.

In practice, that means first creating a storage credential and an external location over the relevant GCS prefix (for example, gs://bucket/tables), granting CREATE EXTERNAL TABLE on that external location, and then registering each table with CREATE TABLE catalogue.schema.table USING DELTA LOCATION 'gs://bucket/tables/UUID_or_table_path' where the path sits under that external location and is dedicated to a single table.

Databricks generally recommends using Unity Catalog managed tables for long-term workloads, so a common migration pattern is to initially register existing Delta data as external tables for a fast cutover, and then, once things are stable, convert those external tables to managed tables so that Unity Catalog controls the storage location, lifecycle, and automatic optimisations going forward.

If this answer resolves your question, could you mark it as โ€œAccept as Solutionโ€? That helps other users quickly find the correct fix.

Regards,
Ashwin | Delivery Solution Architect @ Databricks
Helping you build and scale the Data Intelligence Platform.
***Opinions are my own***

muaaz
Visitor

Hi  @Ashwin_DSA 

Thanks for the reply.

The method you proposed sounds fine, but we are dealing with a very large volume of data around 3 schemas, ~50 tenants, and over 100 tables. Since this data is being migrated from Azure to GCP, we would prefer to avoid writing or managing scripts manually, especially as it may introduce additional validation overhead.

To give more context, the data has already been copied from Azure to GCP, and now we need to register the tables in the catalog.

I would really appreciate it if you could suggest an alternative approach that is more efficient, saves time, and minimizes manual involvement.

Ashwin_DSA
Databricks Employee
Databricks Employee

Hi @muaaz,

Given how much data you have and the fact that itโ€™s already in GCS, there unfortunately isnโ€™t a built-in "auto-discover and register all Delta tables" button in Unity Catalog. You always need some automation to translate the storage layout into table metadata.

The most time and risk efficient pattern I would recommend is to standardise once on a generic registration job, and then let that job do all the work for all schemas/tenants...

You define a clear directory convention on GCS (for example gs://bucket/<schema>/<tenant>/<table>), configure a single external location over the relevant root, and run one Databricks job/notebook that walks that directory tree, detects every _delta_log, infers the catalog/schema/table name from the path, and issues CREATE TABLE โ€ฆ USING DELTA LOCATION 'gs://โ€ฆ' automatically for each match.

That way you are not hand-writing or manually validating 100+ DDLs. Your only manual input is the base path and a small mapping (for example which UC catalog each source schema should land in), and the rest is repeatable and idempotent.

As an alternative for future migrations where you want to avoid dealing with paths at all, you can keep the source on Azure as the system of record, expose those tables via Delta Sharing, and on GCP Databricks materialize them directly as managed Unity Catalog tables. That shifts the problem from "register many external Delta directories" to a controlled, table-driven replication flow without custom per-table scripts. Appreciate this may not work in all scenarios but worth considering.

If this answer resolves your question, could you mark it as โ€œAccept as Solutionโ€? That helps other users quickly find the correct fix.

Regards,
Ashwin | Delivery Solution Architect @ Databricks
Helping you build and scale the Data Intelligence Platform.
***Opinions are my own***

Alright, thank you so much for your support @Ashwin_DSA . It was really helpful.

I will definitely mark it as "Accepted as Solution." I truly appreciate your valuable time.