cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Move/Merge Catalogs

IONA
New Contributor III

Hi

We have a couple of unity catalogs. In each are schema's, some used, some old, same named temp_data etc. etc Within those schema are tables with the same erratic approach. The result of a couple of years of bad housekeeping. (who not experienced that before?)

We now have an initiative to tidy all this up as we merge a couple of departments.

The plan is to create a couple of new catalogs and new schemas within those, then cherry pick the good stuff from the old mess. The first stage of this operation has been to identify the tables we want to keep and set out what the new schemas will be called.

The big question in front of us is just how to best go about porting the old stuff to the new destination. As described its not just a simple 1:1 lift and shift, table "xxx" may well become table "abc" and live in a different schema that which was its previous home.

The data in the old schema is live and will be continued to be updated while the new catalogs are created. There is a  requirement to keep the old and new in sync. Maybe not instant synchronization like database replication might have, but perhaps the new is brought up to date by a refresh every day - or whenever we instruct it to do so!

By now it will probably be obvious to you that at some point all the notebooks and reports that use the current data will be switched to point at the new locations. Maybe this will be achieved by a migration branch of the code in git having all the reference paths to point to the new tidy catalogs, and, at the requisite point in time to "go go go" the new tables will be brough up to date and we'll pull the migration branch into the master and the new world will be born. (The old catalogs will rust and die eventually to be deleted)

So. I'm seeking advice on a best strategy to do all this. I could embark on a roll my own method by scripting out tables, tweaking the scripts to the new names and creating the new tables. Then producing a set of sql statements that first truncate the new tables and select from old and inset in to new in order to refresh the data.

Although - with care - this would work but it feels clunky and error prone. So...my big question and pivotal point of learning is...

Is there a more elegant and "Databricks approved/advised" approach to take?

Something where I would set a configuration file/table to instruct;

"Databricks.. please replicate this table from here to here, and keep the two in sync daily at this time. until I instruct you to stop"

Please help before I roll my own solution and then find out there was a better way all along!

 

2 REPLIES 2

BigRoux
Databricks Employee
Databricks Employee

For moving tables and entire schemas consider looking at the "SYNC" command.

 

The SYNC Command - Your Primary Migration Tool
Databricks provides a SYNC command specifically designed for this exact use case - migrating tables from one location to another while maintaining synchronization . This command can upgrade tables from Hive Metastore to Unity Catalog and can be used for ongoing synchronization between catalogs .
 
Key Features of SYNC:
• Schema-level or table-level operations: You can sync entire schemas or individual tables 
• Automated synchronization: Once synced, the command maintains references between source and target tables 
• Dry run capability: Test migrations before executing with `DRY RUN` parameter 
• Flexible naming: Tables can be renamed during the sync process 
 
Example Usage:
 

-- Sync individual table with new name/location
SYNC TABLE new_catalog.new_schema.abc FROM old_catalog.old_schema.xxx;

-- Sync entire schema
SYNC SCHEMA new_catalog.tidy_schema FROM old_catalog.messy_schema;

-- Test before executing
SYNC TABLE new_catalog.new_schema.abc FROM old_catalog.old_schema.xxx DRY RUN;

There are also some automated refresh strategies you can consider:

1. Materialized Views with Scheduled Refresh

2. Lakeflow Streaming Tables

3. Synched Tables (currently in Preview Mode)

 

Some helpful links: 

https://docs.databricks.com/aws/en/sql/language-manual/sql-ref-syntax-aux-sync

https://community.databricks.com/t5/technical-blog/planning-a-migration-to-unity-catalog/ba-p/50759

https://www.databricks.com/glossary/materialized-views

https://docs.databricks.com/aws/en/dlt/dbsql/materialized

https://docs.databricks.com/aws/en/dlt/streaming-tables

https://docs.databricks.com/aws/en/dlt/flows

 

These recommendations are just that so depending on the circumstances of your situation some of them may not apply.  However, the goal here is to let you know that there are options and you will need to dig in to see which fits your situation.

Hope this helps, Louis.

 

 

kyla598ellis
Visitor

@IONA wrote:

Hi

We have a couple of unity catalogs. In each are schema's, some used, some old, same named temp_data etc. etc Within those schema are tables with the same erratic approach. The result of a couple of years of bad housekeeping. (who not experienced that before?)

We now have an initiative to tidy all this up as we merge a couple of departments.

The plan is to create a couple of new catalogs and new schemas within those, then cherry pick the good stuff from the old mess. The first stage of this operation has been to identify the tables we want to keep and set out what the new schemas will be called.

The big question in front of us is just how to best go about porting the old stuff to the new destination. As described its not just a simple 1:1 lift and shift, table "xxx" may well become table "abc" and live in a different schema that which was its previous home.

The data in the old schema is live and will be continued to be updated while the new catalogs are created. There is a  requirement to keep the old and new in sync. Maybe not instant synchronization like database replication might have, but perhaps the new is brought up to date by a refresh every day - or whenever we instruct it to do so!

By now it will probably be obvious to you that at some point all the notebooks and reports that use the current data will be switched to point at the new locations. Maybe this will be achieved by a migration branch of the code in git having all the reference paths to point to the new tidy catalogs, and, at the requisite point in time to "go go go" the new tables will be brough up to date and we'll pull the migration branch into the master and the new world will be born. (The old catalogs will rust and die eventually to be deleted)

So. I'm seeking advice on a best strategy to do all this. I could embark on a roll my own method by scripting out tables, tweaking the scripts to the new names and creating the new tables. Then producing a set of sql statements that first truncate the new tables and select from old and inset in to new in order to refresh the data.

Although - with care - this would work but it feels clunky and error prone. So...my big question and pivotal point of learning is...

Is there a more elegant and "Databricks approved/advised" approach to take?

Something where I would set a configuration file/table to instruct;

"Databricks.. please replicate this table from here to here, and keep the two in sync daily at this time. until I instruct you to stop"

Please help before I roll my own solution and then find out there was a better way all along!

 


Hello @IONA,
A good approach is to use Databricks’ Delta Live Tables or Auto Loader with scheduled jobs. These let you define pipelines to copy/transform data from old schemas to the new ones, keep them refreshed on a schedule, and reduce manual scripting. This way you get automation, consistency, and easier maintenance compared to a custom roll-your-own process.

Best Regards,
Kyla Ellis


Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now